Wildland fire is considered as one of the most destructive events for both human and nature, especially in the western states of the U.S. It poses a great threat to many aspects of our life, ranging from poor air quality and loss of habitation to the damage of lots of valuable assets in power system infrastructure. The frequency of wildfires has increased by a factor of four compared to 1970 which largely owes to climate change. The worst sufferer, California, is experiencing some of the deadliest wildfires and its aftermath in the recent times. In 2020 alone, 4.2 million acres were burnt due to wildfire in CA which is roughly 4% of the total area of the state . According to Cal Fire, last year the CZU fire in San Mateo and Santa Cruz county destroyed more than 1500 structures and left around 40,000 customers in darkness for consecutive days [czu_fire]
. Back in 2018, the Camp Fire killed 84 people, caused an estimated $9.3 billion in residential property damage alone and ultimately lead the responsible utility Pacific Gas & Electricity (PG&E) to file for bankruptcy[camp_loss]. Apart from CA, we have seen devastating fires in Bastrop county, Texas that was initiated from vegetation contact with power lines and caused more than $300 million in damage [texas].
Due to its adverse effects on stakeholders and environment, wildfire has become a major concern for both the utility companies and forest officials as well as to the members in the research community. Researchers from diverse fields have investigated on its various aspects that include wildfire prediction, monitoring, controlling techniques, vegetation management and long term protection planning. In this paper, we will limit our discussion on the prediction task. Preisler et al. produced a map of potential fire distribution of the whole USA using gridded satellite data and surface observation 
. Later they performed a spatially explicit forecast for modelling the the probability of a large wildland fire. Both of the papers have used remote sensing data to make the prediction in conventional approach. Gasull et al. utilized wireless sensor networks to predict the risk of a wildfire but this is not an effective way in terms of both cost and precision. 
. Machine learning (ML) techniques have also been used in predicting the occurences of fire. One of the recent works by Nadeem et al. used a lasso-logistic framework for human and lightning caused wildland fire occurence. Sayad et al. have shared a new dataset consisting of some of the underlying features causing wildfire and used some baseline ML algorithms for predicting fire locations . None of these works has considered all the important aspects together that are responsible for initiating a fire. At the same time, it requires some deep learning techniques to identify the complex pattern between a fire ignition and its corresponding factors. Deep learning is gaining much popularity due to it’s supremacy in terms of accuracy when trained with huge amount of data.
In this paper, we have investigated the underlying relation between a wildland fire and its corresponding factors to predict the fire locations one day ahead. We have also checked the performance of our proposed method on another dataset used by  in order to prove the efficacy of our approach. The contribution of this paper is summarized as follows:
Constructing a data set consisting of the relevant features (meteorological factors, vegetation status, topology and proximity to nearby power lines) combined together to correctly identify the causation of a wildfire in a holistic manner.
Employing a very useful state-of-the-art deep learning technique, generative. adversarial network (GAN) for conditional generation of tabular data to generate sufficient inputs from the original dataset for properly train ML models. To the best of our knowledge, no one has implemented a GAN based technique in a dataset that has a wide variety of fire risk elevating factors that gives a more accurate prediction of wildfire occurrences.
The remainder of this paper is organized as follows. Section II presents the steps towards building the data set. In Section III, the CTGAN architecture is discussed. Simulation studies are presented in Section IV. Section V concludes the paper with the major findings and future work.
Data Set Construction
Set of features
For predicting fire locations, we first need to analyze the complex dynamics between different contributing factors and how they lead to a fire ignition. Discovering that part is not an easy task as there are a lot of reasons by which a fire can be initiated. So, in order to model the relationship, we have decided to build a dataset from scratch consisting of a wide variety of features. Among them, weather or meteorological factors have a direct, strong causation. Also, it is evident from the fact that over the past few decades, climate change has given rise to the frequency of all sizes of fires, large, small and big, all over the world. We have collected the data for temperature, precipitation, surface pressure, direction and speed of wind and humidity of a specific geographical location. Besides that, naturally occurring wildfires can easily be sparked by dry weather and droughts since dry vegetation acts as a flammable fuel while warm temperature encourages combustion. Therefore, vegetation indices play a prominent role in determining fire hotspots. We have gathered the data for 2 such indices, which are normalized difference vegetation index (NDVI) and enhanced vegetation index (EVI). Both weather and vegetation data can be acquired from weather stations placed in certain locations. However, a big issue of station based data is that the value of a particular weather variable is assumed constant over a wide range of areas under the coverage of that station. Hence, precise data may not be available because we do not have so many stations for reasonably spaced locations. Installing lots of sensors can be a solution that is too expensive for many remote rural areas. By contrast, utilizing satellite images facilitate this purpose by maintaining a good trade-off between precision and cost.
Another important component that influences the fire behaviour of a region is the topography. It is the physical feature of a place which is generally static (unless changed by men or some natural disasters like hurricane, tornado etc.) and is opposite of weather which is ever changing. However, it interacts with wind speed and direction interestingly and thus eventually impacts fire initiation. In the slopes, the wind usually blow up during the day and less dense air (reflected by the surface) rises up along the slope. Hence, the steeper the slope, the quicker the hot air will flow uphill and preheat the flammable fuel to its ignition temperature. Thus the uphill zones are more prone to fire ignition during the daytime. The scenario is opposite at night when the cooler winds start to flow downhill. For this reason, we have incorporated the topographic information of the region in our dataset.
Last but not the least, one of the common causes of wildfire ignition is electrical equipment and power line failures. High temperature arcing created during a high impedance fault can ignite a proximate vegetation and other combustible materials. Other than that during high wind, trees and branches may fall across power lines and can result in two conductors coming in contact with each other. It produces high energy arcing and ejects hot metal particles that can eventually lead to a wildfire ignition. Considering this, we have included the distance between fire locations and the nearest power lines in our dataset to account for the chances of fire ignition due to power system equipment failure.
The complete set of features used in constructing our dataset is shown in Figure id1.
Two most destructive wildfires in CA, ’Camp’ and ’Tubbs’, respectively occurred in Butte county, in November 2018 and in Napa, Sonoma county in October, 2017 were used as the test case for our dataset. To start with, we collected the location of the fire points (latitude-longitude) from the earth observation data by NASA that used MODIS- Terra (MOD14) product to locate the fire pixel . After that we projected those geographic points in QGIS and added some other points those are not on fire. In the dataset, we labeled these points as ’No fire’ while the points derived from  are labeled as ’Fire’. In total, there are 849 data points among which 286 hold the ’Fire’ label while the rest 314 are labeled as ’No fire’. As we are making a day ahead forecast, satellite images for the previous day containing the weather parameters i.e. temperature, specific humidity, precipitable water vapor, surface pressure, eastward and north wind, were collected from MERRA-2 (Modern-Era Retrospective analysis for Research and Applications version 2)  for both the ’Fire’ and ’No fire’ points. The vegetation indices for the same locations were collected from USGS that used Landsat 8 satellite, and having a 30m spatial resolution . Digital elevation model (DEM) was used to get the topographic data of the points from USGS as well . Finally, the information regarding the location of electric transmission lines were acquired from California Energy Commission .
The next step was to process the satellite images (in .tif format) and extract the pixel values to get different weather features (the same applies for vegetation indices and DEM). We used the ‘rgdal’ library of RStudio for image processing and value extraction. The Landsat-8 images for vegetation has 9 spectral bands among which band 2 (Visible Blue), band 4 (Visible Red) and band 5 (Near InfraRed or NIR) was used to extract NDVI and EVI. The NDVI and EVI are calculated from these individual measurements according to the following equation:
Prediction Using CTGAN
Traditional ML approaches like decision trees, clustering algorithms like support vector machines usually provide simple solutions to many classification problems. However, detangling the complex dynamics between wildfire occurrence and its underlying factors is not an easy task which requires more sophisticated techniques loaded with more amount of better quality data. Deep learning (DL) is one among them where the ‘learning’ part happens in the hidden layers of a neural network and ‘deep’ refers to the number of those layers. As the ‘depth’ of the network increases, the more insights about the prediction analysis can be found. One of the biggest hurdles towards utilizing these models is the limited number of data that we have to feed into the network. As, in this case, we are using remote sensing data, there is no simple way to get more data from a high enough resolution source for a given location within a specific time. Hence, in order to enlarge the dataset, we aimed to generate synthetic data from the original set.
Generative adversarial network (GAN) is a DL based supervised learning approach that discovers and learns the pattern in a dataset by its own and produces new data that can be reasonably drawn from the original dataset. There are two sub-networks in a basic GAN model, generator and discriminator. The generator, as the name implies, generates new samples and the discriminator tries to identify correctly whether that sample is coming from the generator or from the original set. When the generator is clever enough to fool the discriminator by the synthetic data, the training phase is complete and the network can successfully generate fake data as close as real data.
There are different types of GANs used for a wide variety of tasks. Radford et al. proposed Deep convolutional GAN (DCGAN) which is an improved version of GANs and used in generating high quality images [dcgan]. InfoGAN is an information-theoretic extension to the GAN that is able to learn disentangled representations in an unsupervised manner 
. WGAN is another variant that uses Wassertain distance in the loss function. Another important adaptation is conditional generative adversarial networks (CGAN) that uses an extra label information as an input to condition on both the generator and discriminator and can generate new samples conditioned on class labels .
Although GAN offers a great flexibility in modeling the distribution of image data, there are certain challenges when it comes to model tabular format of data containing a mixture of discrete and continuous columns. Firstly, in images, the pixel values usually follow a Gaussian distribution which can be normalized and modeled using a min-max transformation. But, values, specifically continuous values, in tabular data do not follow such distribution and the min-max transformation might lead to a vanishing gradient problem. Secondly, the continuous columns might have multimodal distribution. Srivastava et al. showed that the vanilla version of GAN cannot model all modes on a simple tabular dataset. Hence, modeling a continuous column is not an easy task.
To overcome these challenges, Xu et al. came up with the CTGAN (conditional tabular GAN) model that uses mode-specific normalization to combat the non-Gaussian and multimodal distribution of continuous columns and a conditional generator to accurately model the discrete columns in a tabular dataset . For implementing the mode-spec ific normalization, the steps are as follows:
is processed individually using a variational Gaussian mixture model (VGM) to estimate the number of modes and fit a Gaussian mixture. For example, let us assumemodes are estimated by VGM which are . The learned GMM can be expressed as for each value in .
Then, for each in that column, the probability of that value coming from each mode is calculated which is expressed as .
Finally, a mode is sampled given the probability densities and the value is normalized using the sampled mode. The final value of a row in a continuous column is represented by a one-hot encoded vector representing the mode it belongs to and a scalar representing the value within that mode. Lets say, we sample the second mode given. Then the one-hot vector representing the row of column is indicating the second mode and the scalar = representing the value.
In general, any row can be represented as
where is the number of continuous columns and denotes the concatenation operator. This mode specific technique ensures that the generated synthetic data () follows the same distribution as the original data (
) . The authors also verified that a classifier trained by () achieved similar performance on () as a classifier learned on ().
Now, for discrete columns, a conditional generation process is implemented which means new sample generation based on specific category of discrete columns i.e. where is generated sample, is the learnt distribution, is the discrete column and is one of the categories in that discrete column. The conditional generator must learn the real data conditional distribution well enough which means . This is ensured by implementing a conditional vector, training by sampling method and minimizing the generator loss. The detail description of these can be found in .
The overall CTGAN model is depicted in Figure id1 considering number of continuous columns and discrete columns. In our case, we do not have any categorical features, hence the only discrete column is the label indicating fire or no fire class. In the figure, it is assumed that the category in the first discrete column is selected as the condition to generate .
Network architecture of CTGAN
Both the generator and the discriminator are fully connected neural networks capturing all possible correlation between all the columns. Both of them has 2 hidden layers each. Relu is used as the activation function in the generator whereas leaky relu is used in the discriminator. There is another synthetic row representation layer in the generator after the 2 hidden layers. The scalar value in that layeris generated by tanh function while the mode indicator is generated by gumbel softmax.
The original dataset built in section id1 is first divided into training and test set in a 70:30 ratio. Then the training set is used as the input of the CTGAN model. The augmented set i.e. generated synthetic data as the output from the CTGAN model along with the original training set are then used to train the ML models. The efficacy of these models are evaluated using the untouched original test set. This segmentation is depicted in figure Figure id1.
To show the efficacy of our proposed model, we have compared it with 5 other baseline algorithms- decision tree (DT), random forest (RF), gradient boosting (GB), support vector machine (SVM) and a feed forward neural network (NN). DT, RF, GB and SVM were implemented using Scikit-Learn 0.23.2 and NN was trained using Tensorflow v2.0. We also checked the performance of the proposed approach in another data set used by.
In our task, we tuned the hyperparameters of the benchmark models along with our proposed approach in order to get the best result. GridSreachCV was used for the tuning and the set of hyperparameters are given in Table I. The bold values represent the best ones those were later used for training the model.
|Random forest||Number of Estimators (50,100, 200, 600)|
|Maximum depth of each tree (5,10,15,20)|
|Minimum samples split (1,2,3,5)|
|Minimum samples split (1,3,5,10, 15)|
|Support vector machine||Kernel (RBF, linear, polynomial)|
|gamma (0.1 1,10)|
|Neural network||Epochs (50,100,200,300)|
|Batch size (2,8,16, 32,64)|
number of neurons in each layer (5,10,20,
|Learning rate (0.1,0.01,0.001, 0.0001)|
|Dropout rate (0, 0,1 ,0.2))|
Test accuracy is used as the performance metric to evaluate our proposed method. Figure id1 shows the result. The blue bars represent the test accuracy for applying the baseline models while the red bars represent the same when the baseline models were trained with the original training set along with the synthetic data generated from the CTGAN model. Both of the methods (Baseline and CTGAN + Baseline) were tested with the same test set. We can see that except DT and SVM, the remaining 3 models using the augmented dataset by CTGAN outperform their baseline counterparts. The best result is achieved for NN having a test accuracy of 79.31%.
As mentioned earlier, we evaluated the performance of the proposed approach in another dataset introduced by Sayad et al. . That dataset also has 2 classes- fire and no fire but has only 3 features (burnt area, LST, NDVI). The result is presented in Figure id1. It is evident that all the algorithms except RF, produce better result when trained with an augmented dataset.
In this paper, we have done a day ahead prediction of fire locations by analyzing the relation between various underlying factors of fire ignition. First, we have built a dataset consisting of meteorological features, vegetation indices, topology and power line distances. Then a deep learning technique, CTGAN is used to produce enough good quality data in order to obtain a better prediction using ML algorithms. A comparative analysis is performed between 5 baseline algorithms and our proposed method in which the CTGAN combined with neural network yielded the best result. We have also applied our proposed approach in another dataset that further reinforced the efficacy of our method. In future, we aim to incorporate more factors like human activities, wildland-urban interface (WUI) on the prospect of increased rate of migration to make a more robust and wholesome model for wildfire prediction.
-  () . Note: https://nationalmap.gov/downloader Cited by: p14.
-  (2017) Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 214–223. Cited by: p19.
-  (2016-Dec.) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2180–2188. Cited by: p19.
-  () . Note: https://hub.arcgis.com/datasets Cited by: p14.
-  (2021-Jan.) 280,000 customers face losing power in California amid fire-safety blackouts. Note: https://www.forbes.com/sites/roberthart/2021/01/19 Cited by: p7.
-  (2011) Computational intelligence applied to wildfire prediction using wireless sensor networks. In Proceedings of the International Conference on, Vol. , pp. 1–8. External Links: Cited by: p8.
-  (2018) Wildfire mapping in interior alaska using deep neural networks on imbalanced datasets. In Proceedings of the 2018 IEEE International Conference on Data Mining Workshops,
-  () . Note: https://disc.gsfc.nasa.gov/datasets Cited by: p14.
-  (2014) Conditional generative adversarial nets. In arXiv preprint arXiv:1411.1784, Cited by: p19.
-  (2019-Jan.) Mesoscale spatiotemporal predictive models of daily human- and lightning-caused wildland fire occurrence in British Columbia. International Journal of Wildland Fire 29 (), pp. 11–27. Cited by: p8.
-  () Active fire data. Note: https://earthdata.nasa.gov/earth-observation-data/near-real-time/firms/active-fire-data Cited by: p14.
-  (2013-22 November) Global wildfires, carbon emissions and the changing climate. Note: FDI SAP
-  (2009) Forecasting distributions of large federal-lands fires utilizing satellite and gridded weather information. International Journal of Wildland Fire 18 (), pp. 508–516. Cited by: p8.
-  (2011) Spatially explicit forecasts of large wildland fire probability and suppression costs for California. International Journal of Wildland Fire 20 (), pp. 508–517. Cited by: p8.
-  (2020-Aug.) SAR-enhanced mapping of live fuel moisture content. Remote Sens. Environ. 245 (111797), pp. .
-  (2019) Predictive modeling of wildfires: a new dataset and machine learning approach. Fire Safety Journal 104 (), pp. 130–146. Cited by: Figure , p27, p30, p8, p9.
-  (2017-Dec.) VEEGAN: reducing mode collapse in GANs using implicit variational learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3310–3320. Cited by: p20.
-  () . Note: https://earthexplorer.usgs.gov/ Cited by: p14.
-  (2019-Oct.) Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems, Cited by: p21, p23.