1 Introduction
Tropical cyclones are rapidly rotating storm systems centered in a lowpressure region. Tropical cyclones cause heavy rain, strong wind, large storm surges near landfall, and tornadoes, which results in loss of property and lives. About 1.9 million people have died because of tropical cyclones worldwide during the last two centuries [estimate2005robert, cropmanage]. The North Indian ocean (which includes the Bay of Bengal and Arabian sea) alone has seen some of the most devastating tropical cyclones. In 2019, both coasts of India experienced substantial damages because of Vayu and Fani.
It is of high importance to estimate the intensity of a tropical cyclone. A standard indicator of the intensity of the storm is the maximum sustained surface wind speed (MSWS). The World Meteorological Organization categorizes the lowpressure systems using the ranges of MSWS of the tropical cyclones [mssw]. The categorization can be used to determine possible storm surges and damage impact on land [Victor2009Role].
In most tropical cyclone basins, satellitebased Dvorak technique or reconnaissance aircrafts are used to estimate MSWS [windspeed]. These techniques provide reasonable estimates but require advanced machinery. Therefore, estimating MSWS from other tropical cyclone parameters is a significant problem. Much work has been done towards this problem; see [Chaudhuri12Intensity, detailed] and references therein for a complete history of the work relating to cyclone intensity prediction.
We propose a method to estimate MSWS based on other characteristics of a tropical cyclone like date, time, latitude, longitude, pressure drop and estimated central pressure. We use machine learning algorithms to devise a regression model to estimate MSWS from other characteristics. We further employ machine learning classification algorithms to predict the grade of the cyclone based on these characteristics.
2 Materials and methods
2.1 Data
The best track dataset of tropical cyclonic disturbances are collected from the Regional Specialized Meteorological Centre, New Delhi (http://www.rsmcnewdelhi.imd.gov.in/index.php?option=com_content&view=article&id=48&Itemid=194&lang=en) has been used in this study for the period from 1990 to 2017 in the North Indian ocean. The basin of origin, name (if there any), date and time of occurrence, position (latitude and longitude), Class number (or T No.), estimated central pressure, MSWS, pressure drop, grade, outermost closed isobar and diameter of outermost closed isobar of tropical cyclones are provided in the dataset. We define terms which we are going to use in the analysis below [mssw]:

Basin of origin(BOO): The Arabian sea, Bay of Bengal, or land is the possible basin of origins of any cyclone.

Date and Time: The date and time of the origin of the cyclone.

Latitude and Longitude: The latitude and longitude in degrees along the path of the cyclone.

Estimated central pressure (ECP): It is the surface pressure at the center of the tropical cyclone as measured or estimated (in hPa (hectopascals)).

Pressure drop (PD): It is the drop in the pressure with respect to the atmospheric pressure. It is also measured in hPa.

Maximum sustained surface wind (MSWS): The maximum sustained surface wind speed is the highest average of 3 minutes surface wind speed occurring within the circulation of the system. It is measured in knots (nautical miles per hour), which is the same as 1.86 Kilometers per hour.

Grade: Any tropical cyclone that develops within the North Indian Ocean between E and E is monitored by the India Meteorological Department (IMD). Tropical cyclone intensity scale according to cyclone category are given in the following table:
Grade Low pressure system MSWS (in knots) 1 Low Pressure Area (LP) 17 2 Depression (D) 1727 3 Deep Depression (DD) 2833 4 Cyclonic Storm (CS) 3447 5 Severe Cyclonic Storm (SCS) 4863 6 Very Severe Cyclonic Storm (VSCS) 64119 7 Super Cyclonic Storm (SS) 120 Table 1: The classification of the low pressure systems by IMD.
There is a total of 4852 instances of cyclone measurements in the dataset used, out of which we selected 4021 for our study, dropping all of those instances which have any missing feature value. A pictorial description of these cyclones, along with colourcoded grade, is shown in Figure 1. The date is divided into three classes according to three seasons: Pre Monsoon  March to May, Monsoon June to September, Post Monsoon  October to February. Time is divided into two categories according to day and night. We did not include outermost closed isobar and diameter of outermost closed isobar as attributes in our study, as very few data points were available in these columns. Table 2 describes the distribution of data in different categories.
Characteristics  Subdivisions  Number of data points  
Basin of Origin 



Season 



Grade 


2.2 Methodology
The MSWS is a continuous variable, while the grade is a categorical variable. Therefore, we use various machine learning regression and classification algorithms (XGBoost, Gradient Boosting Machine, Linear Regression, Decision Tree, Random Forest, SVM, Naive Bayes, Logistic Regression) for the prediction of MSWS and grade. In what follows, we briefly describe these algorithms.
2.2.1 Decision tree
Decision Tree [10.1023/A:1022643204877] is one of the most popular supervised machine learning algorithms used for both classification and regression techniques. The algorithm can be represented by an inverted tree with a root node at the top and other nodes connected to it through branches. Each node corresponds to a feature and a value assigned to the feature, while each branch represents a decision taken for the output variable based on the node it is emanating. To decide which feature to be placed at a node, we use measures like the Gini index, Entropy, or Information gain. For a given attribute ,

Entropy is defined as

Information gain is defined as

Gini index is defined as
where
denotes the probability,
denotes the entropy and is the conditional entropy for a particular instance of. We can determine the importance of a given attribute of a feature vector by calculating one of the above for that attribute.
2.2.2 Random Forest
Random forest [10.1023/A:1010933404324] is an ensemble learning method that can be used for both classification and regression. It generates multiple decision trees as part of the training process and outputs the mode (average) of these trees as per the classification (regression) problem. This approach solves the problem of overfitting, which is prevalent in the case of Decision Trees.
2.2.3 Gradient Boosting Machine
Gradient Boosting Machine [Friedman00greedyfunction] is an ensemble machine learning technique that is used for both classification and regression problems. It depends on the boosting technique where each weak learner is assigned a large weight to convert them to a strong learner in an iterative manner.
2.2.4 XGBoost
XGBoost [DBLP:journals/corr/ChenG16]
is one of the most popular recent supervised learning tree boosting scalable machine learning algorithms, which is based on function approximation and several regularization techniques. It is used for both classification and regression problems. Let
is the outcome from the ensemble model defined as follows:where , , is the space of all regression trees and denotes the total number of leaves in the tree. In the above equation, represents a regression tree and is the outcome given by the th tree to the th entries in the data. The goal in XGBoost is to minimize the following regularized objective function:
where
is the loss function. To avoid high complexity of the model, a regularization term
is used which is given byWhere and are regularization parameters, the best split at any given node can be found from the following formula:
Where stands for lefthand node and stands for righthand node by letting . Figure 3 shows the XGBoost tree for the estimation of MSWS.
2.2.5 Linear Regression
In Linear Regression [Jeffrey2001Linear]
, a hyperplane is estimated that gives best linear relationship between independent variables (features) and dependent variable (target). The prediction model (hypothesis) is given by :
where represents the input vector and are the coefficients that determine the hyperplane. These coefficients are learned through an iterative process called gradient descent by minimizing the following loss function:
where denotes the th input vector and corresponding target value.
2.2.6 Logistic Regression
Logistic regression [Walker1967Estimation]
is a classifier that can be used to solve a multiclass prediction problem. Its an extension of Linear Regression, where the classification problem is converted into regression problem by estimating the log(odds) of each class in place of probability itself. If
denotes the probability of th class then the log(odds) for this class is defined as .2.2.7 Support Vector Machines (SVM)
SVM [Cortes1995Support]
can be used for both classification and regression problems. Like the Linear regression, SVM tries to find a separating hyperplane, but with maximum margin. The learning problem is converted into an objective (nonlinear) maximization problem, subject to linear constraints. Using the tools of Linear Programming Problem (LPP), few input vectors (called support vectors) are selected that can be used for prediction. The nonlinear separating case of input vectors can be handled with kernels techniques.
2.2.8 Naive Bayes
Naive Bayes [Rish01anempirical]
can be used for both classification and regression problems. The Naive Bayes algorithm is based on Bayes’ theorem with an assumption that the features are linearly independent. Suppose
are realvalued attributes, and is the set of all possible outcomes. Now according to the Bayes’ theorem,If we assume that are conditionally independent for given outcome set , then the above equation can be written as
The above equation is used for the classification problem. Similarly, we can define Naive Bayes for regression problems, where the sum in the above equation will be replaced by integration.
2.2.9 Metrics
To evaluate the performance of regression models for MSWS, we use the Root Mean square error (RMSE) and Coefficient of determination ().

RMSE: If there are sample points with as actual value and as predicted value evaluated from the model, then RMSE is defined as
RMSE is always nonnegative and should be close to .

: The coefficient of determination () is defined as
where total sum of squares, , is defined as and residual sum of squares, is defined as . Here, is the mean of the data, .
The confusion matrix is used to determine the performance of the classification model on the test data. For classification models, multiclass classification accuracy has been measured using the confusion matrix
[confusionmatrix]. Accuracy is the ratio between all correctly predicted samples to all possible samples.3 Results and Discussions
3.1 Correlation analysis
The correlation matrix of all variables is given in Figure 2. The grade is weakly correlated with all the variables except ECP and PD. Also, the correlation of grade with ECP is negative, suggesting that if central pressure is low, the intensity of the cyclone is high. The MSWS shares a similar correlation with ECP as the grade. This is not surprising as grade is directly evaluated from MSWS; see Table 1. PD has a strong positive correlation with MSWS. A linear regression suggests the following relationship between MSWS and PD
in the North Indian Ocean. Notice that in [Rosendal982Relationship], a similar relationship between MSWS and PD () was reported for tropical cyclones in Central North Pacific Ocean.
3.2 Model selection and validation
We use 10fold crossvalidation for each of the models. In each fold, we split the data into training and validation sets in the ratio of 4:1. Then, each ML algorithm is applied to the training set to train the model. At every step, the performances (RMSE, , or accuracy) of the model are recorded, and the average of each of these performances is reported in Tables 2(a) and 2(b).


It is evident from Table 2(a) that XGBoost is outperforming other models with an RMSE of 2.3 and of 0.99. Notice from Table 1 that the range of values of MSWS for a particular grade is always greater than or equal to 5, and since XGBoost is predicting MSWS with an RMSE of 2.3, we expect that XGBoost will also predict grade with very high accuracy. That is definitely the case, as from Table 2(b), XGboost has an accuracy of 87.15% in predicting the grade. However, the Decision Tree with Entropy of depth 4 outperforms XGBoost in predicting the grade with an accuracy of 87.91%.
Moreover, if we fix the classification model for the grade to be the Decision Tree with Entropy of depth 4, Table 3 represents the accuracy in predicting a particular category for the grade. The model predicts the top three highintensity categories (SCS, VSCS, and SS) of grade with an average accuracy of 98.84%.
Category  Accuracy 
LP  98.33 
D  77.92 
DD  78.37 
CS  88.72 
SCS  100 
VSCS  99.51 
SS  97 
3.3 Testing on Vayu and Fani
We test our model on two recent tropical cyclones, Vayu and Fani. Vayu was a grade 7 tropical cyclone which hit the Indian west coast in June 2019. Around 6.6 million people were affected in northwestern India by the cyclone [VAYU]. Fani was also a grade 7 tropical cyclone that hit the Indian state of Odisha in AprilMay 2019. Due to Fani, India and Bangladesh faced heavy damages. At least 89 people have been reported died, and damages caused estimated around US$8.1 billion [FANI].
We checked the performance of the best model to predict MSWS, XGBoost, on Vayu and Fani. The RMSE is 2.2 and 3.4, while is 0.99 and 0.99 for Vayu and Fani, respectively. Figure 4 depicts the actual values of MSWS and values predicted by the XGBoost model during the course of Vayu and Fani.
4 Conclusion
Estimating the intensity of tropical cyclones on a realtime basis is a problem worth studying, considering the human life and economic loss involved. In this study, we explored various machine learning techniques and reported their performance to estimate the Maximum Surface Sustained Wind Speed and intensity of the tropical cyclone. Our research finds that the ML model XGBoost and Decision Tree can be used for the estimation of MSWS and intensity with excellent performance over the North Indian ocean.
Acknowledgement
Authors are thankful to the Indian Meteorological Department (IMD) for providing the data archives.
Conflict of Interest
All the authors declare that they have no conflict of interest.
Comments
There are no comments yet.