Tropical cyclone intensity estimations over the Indian ocean using Machine Learning

by   Koushik Biswas, et al.
IIIT Delhi

Tropical cyclones are one of the most powerful and destructive natural phenomena on earth. Tropical storms and heavy rains can cause floods, which lead to human lives and economic loss. Devastating winds accompanying cyclones heavily affect not only the coastal regions, even distant areas. Our study focuses on the intensity estimation, particularly cyclone grade and maximum sustained surface wind speed (MSWS) of a tropical cyclone over the North Indian Ocean. We use various machine learning algorithms to estimate cyclone grade and MSWS. We have used the basin of origin, date, time, latitude, longitude, estimated central pressure, and pressure drop as attributes of our models. We use multi-class classification models for the categorical outcome variable, cyclone grade, and regression models for MSWS as it is a continuous variable. Using the best track data of 28 years over the North Indian Ocean, we estimate grade with an accuracy of 88 2.3. For higher grade categories (5-7), accuracy improves to an average of 98.84 Indian Ocean, Vayu and Fani. For grade, we obtained an accuracy of 93.22 95.23 of 0.99 and 0.99, respectively.



There are no comments yet.


page 6

page 8

page 9


Predicting wind pressures around circular cylinders using machine learning techniques

Numerous studies have been carried out to measure wind pressures around ...

Prediction of Landfall Intensity, Location, and Time of a Tropical Cyclone

The prediction of the intensity, location and time of the landfall of a ...

Detecting chaos in hurricane intensity

Determining the maximum potential limit in the accuracy of hurricane int...

Inference of Personal Attributes from Tweets Using Machine Learning

Using machine learning algorithms, including deep learning, we studied t...

Automated surface feature selection using SALSA2D: An illustration using Elephant Mortality data in Etosha National Park

This analysis is motivated by the MIKE dataset in Etosha National Park (...

A Predictive Model for Steady-State Multiphase Pipe Flow: Machine Learning on Lab Data

Engineering simulators used for steady-state multiphase pipe flows are c...

Shallow Art: Art Extension Through Simple Machine Learning

Shallow Art presents, implements, and tests the use of simple single-out...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tropical cyclones are rapidly rotating storm systems centered in a low-pressure region. Tropical cyclones cause heavy rain, strong wind, large storm surges near landfall, and tornadoes, which results in loss of property and lives. About 1.9 million people have died because of tropical cyclones worldwide during the last two centuries [estimate2005robert, cropmanage]. The North Indian ocean (which includes the Bay of Bengal and Arabian sea) alone has seen some of the most devastating tropical cyclones. In 2019, both coasts of India experienced substantial damages because of Vayu and Fani.

It is of high importance to estimate the intensity of a tropical cyclone. A standard indicator of the intensity of the storm is the maximum sustained surface wind speed (MSWS). The World Meteorological Organization categorizes the low-pressure systems using the ranges of MSWS of the tropical cyclones [mssw]. The categorization can be used to determine possible storm surges and damage impact on land [Victor2009Role].

In most tropical cyclone basins, satellite-based Dvorak technique or reconnaissance air-crafts are used to estimate MSWS [windspeed]. These techniques provide reasonable estimates but require advanced machinery. Therefore, estimating MSWS from other tropical cyclone parameters is a significant problem. Much work has been done towards this problem; see [Chaudhuri12Intensity, detailed] and references therein for a complete history of the work relating to cyclone intensity prediction.

We propose a method to estimate MSWS based on other characteristics of a tropical cyclone like date, time, latitude, longitude, pressure drop and estimated central pressure. We use machine learning algorithms to devise a regression model to estimate MSWS from other characteristics. We further employ machine learning classification algorithms to predict the grade of the cyclone based on these characteristics.

2 Materials and methods

2.1 Data

The best track dataset of tropical cyclonic disturbances are collected from the Regional Specialized Meteorological Centre, New Delhi ( has been used in this study for the period from 1990 to 2017 in the North Indian ocean. The basin of origin, name (if there any), date and time of occurrence, position (latitude and longitude), Class number (or T No.), estimated central pressure, MSWS, pressure drop, grade, outermost closed isobar and diameter of outermost closed isobar of tropical cyclones are provided in the dataset. We define terms which we are going to use in the analysis below [mssw]:

  • Basin of origin(BOO): The Arabian sea, Bay of Bengal, or land is the possible basin of origins of any cyclone.

  • Date and Time: The date and time of the origin of the cyclone.

  • Latitude and Longitude: The latitude and longitude in degrees along the path of the cyclone.

  • Estimated central pressure (ECP): It is the surface pressure at the center of the tropical cyclone as measured or estimated (in hPa (hectopascals)).

  • Pressure drop (PD): It is the drop in the pressure with respect to the atmospheric pressure. It is also measured in hPa.

  • Maximum sustained surface wind (MSWS): The maximum sustained surface wind speed is the highest average of 3 minutes surface wind speed occurring within the circulation of the system. It is measured in knots (nautical miles per hour), which is the same as 1.86 Kilometers per hour.

  • Grade: Any tropical cyclone that develops within the North Indian Ocean between E and E is monitored by the India Meteorological Department (IMD). Tropical cyclone intensity scale according to cyclone category are given in the following table:

    Grade Low pressure system MSWS (in knots)
    1 Low Pressure Area (LP) 17
    2 Depression (D) 17-27
    3 Deep Depression (DD) 28-33
    4 Cyclonic Storm (CS) 34-47
    5 Severe Cyclonic Storm (SCS) 48-63
    6 Very Severe Cyclonic Storm (VSCS) 64-119
    7 Super Cyclonic Storm (SS) 120
    Table 1: The classification of the low pressure systems by IMD.
Figure 1: Cyclones hitting India since 1990-2017.

There is a total of 4852 instances of cyclone measurements in the dataset used, out of which we selected 4021 for our study, dropping all of those instances which have any missing feature value. A pictorial description of these cyclones, along with colour-coded grade, is shown in Figure 1. The date is divided into three classes according to three seasons: Pre Monsoon - March to May, Monsoon- June to September, Post Monsoon - October to February. Time is divided into two categories according to day and night. We did not include outermost closed isobar and diameter of outermost closed isobar as attributes in our study, as very few data points were available in these columns. Table 2 describes the distribution of data in different categories.

Characteristics Subdivisions Number of data points
Basin of Origin
Arabian Sea
Bay of Bengal
Pre-Monsoon (March - May)
Monsoon (June to September)
Post-monsoon (October - February)
Table 2: Baseline Data.

2.2 Methodology

The MSWS is a continuous variable, while the grade is a categorical variable. Therefore, we use various machine learning regression and classification algorithms (XGBoost, Gradient Boosting Machine, Linear Regression, Decision Tree, Random Forest, SVM, Naive Bayes, Logistic Regression) for the prediction of MSWS and grade. In what follows, we briefly describe these algorithms.

2.2.1 Decision tree

Decision Tree [10.1023/A:1022643204877] is one of the most popular supervised machine learning algorithms used for both classification and regression techniques. The algorithm can be represented by an inverted tree with a root node at the top and other nodes connected to it through branches. Each node corresponds to a feature and a value assigned to the feature, while each branch represents a decision taken for the output variable based on the node it is emanating. To decide which feature to be placed at a node, we use measures like the Gini index, Entropy, or Information gain. For a given attribute ,

  • Entropy is defined as

  • Information gain is defined as

  • Gini index is defined as


denotes the probability,

denotes the entropy and is the conditional entropy for a particular instance of

. We can determine the importance of a given attribute of a feature vector by calculating one of the above for that attribute.

2.2.2 Random Forest

Random forest [10.1023/A:1010933404324] is an ensemble learning method that can be used for both classification and regression. It generates multiple decision trees as part of the training process and outputs the mode (average) of these trees as per the classification (regression) problem. This approach solves the problem of overfitting, which is prevalent in the case of Decision Trees.

2.2.3 Gradient Boosting Machine

Gradient Boosting Machine [Friedman00greedyfunction] is an ensemble machine learning technique that is used for both classification and regression problems. It depends on the boosting technique where each weak learner is assigned a large weight to convert them to a strong learner in an iterative manner.

2.2.4 XGBoost

XGBoost [DBLP:journals/corr/ChenG16]

is one of the most popular recent supervised learning tree boosting scalable machine learning algorithms, which is based on function approximation and several regularization techniques. It is used for both classification and regression problems. Let

is the outcome from the ensemble model defined as follows:

where , , is the space of all regression trees and denotes the total number of leaves in the tree. In the above equation, represents a regression tree and is the outcome given by the th tree to the th entries in the data. The goal in XGBoost is to minimize the following regularized objective function:


is the loss function. To avoid high complexity of the model, a regularization term

is used which is given by

Where and are regularization parameters, the best split at any given node can be found from the following formula:

Where stands for left-hand node and stands for right-hand node by letting . Figure 3 shows the XGBoost tree for the estimation of MSWS.

2.2.5 Linear Regression

In Linear Regression [Jeffrey2001Linear]

, a hyperplane is estimated that gives best linear relationship between independent variables (features) and dependent variable (target). The prediction model (hypothesis) is given by :

where represents the input vector and are the coefficients that determine the hyperplane. These coefficients are learned through an iterative process called gradient descent by minimizing the following loss function:

where denotes the th input vector and corresponding target value.

2.2.6 Logistic Regression

Logistic regression [Walker1967Estimation]

is a classifier that can be used to solve a multiclass prediction problem. Its an extension of Linear Regression, where the classification problem is converted into regression problem by estimating the log(odds) of each class in place of probability itself. If

denotes the probability of th class then the log(odds) for this class is defined as .

2.2.7 Support Vector Machines (SVM)

SVM [Cortes1995Support]

can be used for both classification and regression problems. Like the Linear regression, SVM tries to find a separating hyperplane, but with maximum margin. The learning problem is converted into an objective (nonlinear) maximization problem, subject to linear constraints. Using the tools of Linear Programming Problem (LPP), few input vectors (called support vectors) are selected that can be used for prediction. The nonlinear separating case of input vectors can be handled with kernels techniques.

2.2.8 Naive Bayes

Naive Bayes [Rish01anempirical]

can be used for both classification and regression problems. The Naive Bayes algorithm is based on Bayes’ theorem with an assumption that the features are linearly independent. Suppose

are real-valued attributes, and is the set of all possible outcomes. Now according to the Bayes’ theorem,

If we assume that are conditionally independent for given outcome set , then the above equation can be written as

The above equation is used for the classification problem. Similarly, we can define Naive Bayes for regression problems, where the sum in the above equation will be replaced by integration.

2.2.9 Metrics

To evaluate the performance of regression models for MSWS, we use the Root Mean square error (RMSE) and Coefficient of determination ().

  • RMSE: If there are sample points with as actual value and as predicted value evaluated from the model, then RMSE is defined as

    RMSE is always non-negative and should be close to .

  • : The coefficient of determination () is defined as

    where total sum of squares, , is defined as and residual sum of squares, is defined as . Here, is the mean of the data, .

The confusion matrix is used to determine the performance of the classification model on the test data. For classification models, multi-class classification accuracy has been measured using the confusion matrix

[confusionmatrix]. Accuracy is the ratio between all correctly predicted samples to all possible samples.

3 Results and Discussions

3.1 Correlation analysis

Figure 2: Correlation between variables.

The correlation matrix of all variables is given in Figure 2. The grade is weakly correlated with all the variables except ECP and PD. Also, the correlation of grade with ECP is negative, suggesting that if central pressure is low, the intensity of the cyclone is high. The MSWS shares a similar correlation with ECP as the grade. This is not surprising as grade is directly evaluated from MSWS; see Table 1. PD has a strong positive correlation with MSWS. A linear regression suggests the following relationship between MSWS and PD

in the North Indian Ocean. Notice that in [Rosendal982Relationship], a similar relationship between MSWS and PD () was reported for tropical cyclones in Central North Pacific Ocean.

3.2 Model selection and validation

We use 10-fold cross-validation for each of the models. In each fold, we split the data into training and validation sets in the ratio of 4:1. Then, each ML algorithm is applied to the training set to train the model. At every step, the performances (RMSE, , or accuracy) of the model are recorded, and the average of each of these performances is reported in Tables 2(a) and 2(b).

Model RMSE
XGBoost 2.30 .99
Gradient Boosting
2.80 0.97
Decision Tree 3.91 0.94
Random Forest 3.12 0.96
Linear Regression 5.07 0.92
Kernel: Polynomial
Naive Bayes 3.38 0.97
(a) Regression Analysis on MSWS.
Model Accuracy
XGBoost 87.15
GBM 85.73
Decision Tree
Random forest 85.95
Naive Bayes 86.39
Logistic 71.28
Kernel: Linear
Kernel: Polynomial
(degree 4)
(b) Classification(Multi-class) Analysis on Cyclone grade.

It is evident from Table 2(a) that XGBoost is outperforming other models with an RMSE of 2.3 and of 0.99. Notice from Table 1 that the range of values of MSWS for a particular grade is always greater than or equal to 5, and since XGBoost is predicting MSWS with an RMSE of 2.3, we expect that XGBoost will also predict grade with very high accuracy. That is definitely the case, as from Table 2(b), XGboost has an accuracy of 87.15% in predicting the grade. However, the Decision Tree with Entropy of depth 4 outperforms XGBoost in predicting the grade with an accuracy of 87.91%.

Moreover, if we fix the classification model for the grade to be the Decision Tree with Entropy of depth 4, Table 3 represents the accuracy in predicting a particular category for the grade. The model predicts the top three high-intensity categories (SCS, VSCS, and SS) of grade with an average accuracy of 98.84%.

Category Accuracy
LP 98.33
D 77.92
DD 78.37
CS 88.72
SCS 100
VSCS 99.51
SS 97
Table 3: Classification accuracy of different Cyclone Grade.
Figure 3: XGBoost tree for MSWS.

3.3 Testing on Vayu and Fani

Figure 4: Scatter plot of actual and model-predicted MSWS for Fani and Vayu.
Figure 5: Actual and model predicted grade along track of Fani.
Figure 6: Actual and model predicted grade along track of Vayu.

We test our model on two recent tropical cyclones, Vayu and Fani. Vayu was a grade 7 tropical cyclone which hit the Indian west coast in June 2019. Around 6.6 million people were affected in northwestern India by the cyclone [VAYU]. Fani was also a grade 7 tropical cyclone that hit the Indian state of Odisha in April-May 2019. Due to Fani, India and Bangladesh faced heavy damages. At least 89 people have been reported died, and damages caused estimated around US$8.1 billion [FANI].

We checked the performance of the best model to predict MSWS, XGBoost, on Vayu and Fani. The RMSE is 2.2 and 3.4, while is 0.99 and 0.99 for Vayu and Fani, respectively. Figure 4 depicts the actual values of MSWS and values predicted by the XGBoost model during the course of Vayu and Fani.

The best model to predict grade, Decision Tree with Entropy with depth 4, predicts different grades during the course of Vayu and Fani with an accuracy of 93.22% and 95.23%, respectively. The actual and predicted grades along the track of Vayu and Fani is shown in Figures 5 and 6.

4 Conclusion

Estimating the intensity of tropical cyclones on a real-time basis is a problem worth studying, considering the human life and economic loss involved. In this study, we explored various machine learning techniques and reported their performance to estimate the Maximum Surface Sustained Wind Speed and intensity of the tropical cyclone. Our research finds that the ML model XGBoost and Decision Tree can be used for the estimation of MSWS and intensity with excellent performance over the North Indian ocean.


Authors are thankful to the Indian Meteorological Department (IMD) for providing the data archives.

Conflict of Interest

All the authors declare that they have no conflict of interest.