Analyzing hurricane tracks - Part 3/3Ā¶
This is the third part to a three part set of notebooks that process and analyze historic hurricane tracks. In the previous notebooks we saw
Part 1
- downloading historic hurricane datasets using Python
- cleaning and merging hurricane observations using DASK
- aggregating point observations into hurricane tracks using ArcGIS GeoAnalytics server
Part 2
- Visualize the locations of hurricane tracks
- Different basins and the number of hurricanes per basin
- Number of hurricanes over time
- Seasonality in occurrence of hurricanes
- Places where hurricanes make landfall and the people affected
In this notebook you will analyze the aggregated tracks to answer important questions about hurricane severity and how they correlate over time.
Import the libraries necessary for this notebook.
# import ArcGIS Libraries
from arcgis.gis import GIS
from arcgis.geometry import filters
from arcgis.geocoding import geocode
from arcgis.features.manage_data import overlay_layers
from arcgis.geoenrichment import enrich
# import Pandas for data exploration
import pandas as pd
import numpy as np
from scipy import stats
# import plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# import display tools
from pprint import pprint
from IPython.display import display
# import system libs
from sys import getsizeof
gis = GIS('home')
import warnings
warnings.filterwarnings('ignore')
Access aggregated hurricane dataĀ¶
Below, we access the tracks aggregated using GeoAnalytics in the previous notebook.
hurricane_tracks_item = gis.content.search('title:hurricane_tracks_aggregated_ga')[0]
hurricane_fl = hurricane_tracks_item.layers[0]
The GeoAnalytics step calculated summary statistics of all numeric fields. However only a few of the columns are of interest to us.
pprint([f['name'] for f in hurricane_fl.properties.fields], compact=True, width=80)
['objectid', 'serial_num', 'count', 'count_col_1', 'sum_col_1', 'min_col_1', 'max_col_1', 'mean_col_1', 'range_col_1', 'sd_col_1', 'var_col_1', 'count_season', 'sum_season', 'min_season', 'max_season', 'mean_season', 'range_season', 'sd_season', 'var_season', 'count_num', 'sum_num', 'min_num', 'max_num', 'mean_num', 'range_num', 'sd_num', 'var_num', 'count_basin', 'any_basin', 'count_sub_basin', 'any_sub_basin', 'count_name', 'any_name', 'count_iso_time', 'any_iso_time', 'count_nature', 'any_nature', 'count_center', 'any_center', 'count_track_type', 'any_track_type', 'count_current_basin', 'any_current_basin', 'count_latitude_merged', 'sum_latitude_merged', 'min_latitude_merged', 'max_latitude_merged', 'mean_latitude_merged', 'range_latitude_merged', 'sd_latitude_merged', 'var_latitude_merged', 'count_longitude_merged', 'sum_longitude_merged', 'min_longitude_merged', 'max_longitude_merged', 'mean_longitude_merged', 'range_longitude_merged', 'sd_longitude_merged', 'var_longitude_merged', 'count_wind_merged', 'sum_wind_merged', 'min_wind_merged', 'max_wind_merged', 'mean_wind_merged', 'range_wind_merged', 'sd_wind_merged', 'var_wind_merged', 'count_pressure_merged', 'sum_pressure_merged', 'min_pressure_merged', 'max_pressure_merged', 'mean_pressure_merged', 'range_pressure_merged', 'sd_pressure_merged', 'var_pressure_merged', 'count_grade_merged', 'sum_grade_merged', 'min_grade_merged', 'max_grade_merged', 'mean_grade_merged', 'range_grade_merged', 'sd_grade_merged', 'var_grade_merged', 'count_eye_dia_merged', 'sum_eye_dia_merged', 'min_eye_dia_merged', 'max_eye_dia_merged', 'mean_eye_dia_merged', 'range_eye_dia_merged', 'sd_eye_dia_merged', 'var_eye_dia_merged', 'track_duration', 'end_datetime', 'start_datetime']
Below we select the following fields for the rest of this analysis.
fields_to_query = ['objectid', 'count', 'min_season', 'any_basin', 'any_sub_basin',
'any_name', 'mean_latitude_merged', 'mean_longitude_merged',
'max_wind_merged', 'range_wind_merged', 'min_pressure_merged',
'range_pressure_merged', 'max_eye_dia_merged', 'track_duration',
'end_datetime', 'start_datetime']
Query hurricane tracks into a Spatially enabled DataFrame
Ā¶
%%time
all_hurricanes_df = hurricane_fl.query(out_fields=','.join(fields_to_query), as_df=True)
CPU times: user 1.12 s, sys: 318 ms, total: 1.43 s Wall time: 4.5 s
all_hurricanes_df.shape
(12362, 17)
There are 12,362
hurricanes identified by GeoAnalytics aggregate tracks tool. To get an idea about this aggregated dataset, call the head()
method.
all_hurricanes_df.head()
SHAPE | any_basin | any_name | any_sub_basin | count | end_datetime | max_eye_dia_merged | max_wind_merged | mean_latitude_merged | mean_longitude_merged | min_pressure_merged | min_season | objectid | range_pressure_merged | range_wind_merged | start_datetime | track_duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {"paths": [[[59.60000000000002, -17.6000000000... | SI | NOT NAMED | MM | 7.0 | 1854-02-10 18:00:00 | NaN | NaN | -19.318571 | 60.639286 | NaN | 1854.0 | 1 | NaN | NaN | 1854-02-08 06:00:00 | 1.296000e+08 |
1 | {"paths": [[[-23.5, 12.5], [-24.19999999999999... | NA | NOT NAMED | NA | 9.0 | 1859-08-26 12:00:00 | NaN | 45.0 | 14.000000 | -26.222222 | NaN | 1859.0 | 2 | NaN | 10.0 | 1859-08-24 12:00:00 | 1.728000e+08 |
2 | {"paths": [[[-23.19999999999999, 12.1000000000... | NA | UNNAMED | NA | 50.0 | 1853-09-12 18:00:00 | NaN | 130.0 | 26.982000 | -51.776000 | 924.0 | 1853.0 | 3 | 53.0 | 90.0 | 1853-08-30 00:00:00 | 1.058400e+09 |
3 | {"paths": [[[59.80000000000001, -15.5], [59.49... | SI | XXXX856017 | MM | 13.0 | 1856-04-05 18:00:00 | NaN | NaN | -20.185385 | 59.573077 | NaN | 1856.0 | 4 | NaN | NaN | 1856-04-02 18:00:00 | 2.592000e+08 |
4 | {"paths": [[[99.60000000000002, -11.5], [98.30... | SI | NOT NAMED | WA | 13.0 | 1861-03-15 18:00:00 | NaN | NaN | -12.940769 | 94.183846 | NaN | 1861.0 | 5 | NaN | NaN | 1861-03-12 18:00:00 | 2.592000e+08 |
To better analyze this data set, the date columns need to be changed to a format that Pandas understands better. This is accomplished by calling to_datetime()
method and passing the appropriate time columns.
all_hurricanes_df['start_datetime'] = pd.to_datetime(all_hurricanes_df['start_datetime'])
all_hurricanes_df['end_datetime'] = pd.to_datetime(all_hurricanes_df['end_datetime'])
all_hurricanes_df.index = all_hurricanes_df['start_datetime']
all_hurricanes_df.head()
SHAPE | any_basin | any_name | any_sub_basin | count | end_datetime | max_eye_dia_merged | max_wind_merged | mean_latitude_merged | mean_longitude_merged | min_pressure_merged | min_season | objectid | range_pressure_merged | range_wind_merged | start_datetime | track_duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
start_datetime | |||||||||||||||||
1854-02-08 06:00:00 | {"paths": [[[59.60000000000002, -17.6000000000... | SI | NOT NAMED | MM | 7.0 | 1854-02-10 18:00:00 | NaN | NaN | -19.318571 | 60.639286 | NaN | 1854.0 | 1 | NaN | NaN | 1854-02-08 06:00:00 | 1.296000e+08 |
1859-08-24 12:00:00 | {"paths": [[[-23.5, 12.5], [-24.19999999999999... | NA | NOT NAMED | NA | 9.0 | 1859-08-26 12:00:00 | NaN | 45.0 | 14.000000 | -26.222222 | NaN | 1859.0 | 2 | NaN | 10.0 | 1859-08-24 12:00:00 | 1.728000e+08 |
1853-08-30 00:00:00 | {"paths": [[[-23.19999999999999, 12.1000000000... | NA | UNNAMED | NA | 50.0 | 1853-09-12 18:00:00 | NaN | 130.0 | 26.982000 | -51.776000 | 924.0 | 1853.0 | 3 | 53.0 | 90.0 | 1853-08-30 00:00:00 | 1.058400e+09 |
1856-04-02 18:00:00 | {"paths": [[[59.80000000000001, -15.5], [59.49... | SI | XXXX856017 | MM | 13.0 | 1856-04-05 18:00:00 | NaN | NaN | -20.185385 | 59.573077 | NaN | 1856.0 | 4 | NaN | NaN | 1856-04-02 18:00:00 | 2.592000e+08 |
1861-03-12 18:00:00 | {"paths": [[[99.60000000000002, -11.5], [98.30... | SI | NOT NAMED | WA | 13.0 | 1861-03-15 18:00:00 | NaN | NaN | -12.940769 | 94.183846 | NaN | 1861.0 | 5 | NaN | NaN | 1861-03-12 18:00:00 | 2.592000e+08 |
The track duration and length columns need to be projected to units (days, hours, miles) that are meaningful for analysis.
all_hurricanes_df['track_duration_hrs'] = all_hurricanes_df['track_duration'] / 3600000
all_hurricanes_df['track_duration_days'] = all_hurricanes_df['track_duration'] / (3600000*24)
Query landfall tracks layer into a Spatially Enabled DataFrame
Ā¶
We query the landfall tracks layer created in the pervious notebook into a DataFrame.
inland_tracks = gis.content.search('hurricane_landfall_tracks')[0]
fields_to_query = ['min_season', 'any_basin','any_name', 'max_wind_merged',
'min_pressure_merged', 'track_duration','end_datetime',
'start_datetime', 'analysislength']
landfall_tracks_fl = inland_tracks.layers[0]
landfall_tracks_df = landfall_tracks_fl.query(out_fields=fields_to_query).df
landfall_tracks_df.head(3)
analysislength | any_basin | any_name | end_datetime | max_wind_merged | min_pressure_merged | min_season | objectid | start_datetime | track_duration | SHAPE | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.376642 | NA | NOT NAMED | -3663424800000 | 95.0 | 965.0 | 1853.0 | 1 | -3664699200000 | 1.317600e+09 | {'paths': [[[-74.47272727299998, 24], [-74.463... |
1 | 117.097286 | NA | UNNAMED | -3645172800000 | 70.0 | NaN | 1854.0 | 2 | -3645475200000 | 2.160000e+08 | {'paths': [[[-99.13749999999999, 26.5699999999... |
2 | 256.909588 | NA | UNNAMED | -3645172800000 | 70.0 | NaN | 1854.0 | 3 | -3645475200000 | 2.160000e+08 | {'paths': [[[-102.21739130399999, 27.686956522... |
Manage missing sesnsor dataĀ¶
Before we can analyze if hurricanes intensify over time, we need to identify and account for missing values in our data. Sensor measurements such as wind speed, atmospheric pressure, eye diameter, generally suffer from missing values and outliers. The reconstruct tracks tool has identified 12,362
individual hurricanes that occurred during the past 169
years.
all_hurricanes_df.shape
(12362, 19)
Visualize missing recordsĀ¶
An easy way to visualize missing records is to hack the heatmap
of seaborn
library to display missing records. The snippet below shows missing records in yellow color.
missing_data_viz = all_hurricanes_df.replace(0,np.NaN)
missing_data_viz = missing_data_viz.replace(-9999.0,np.NaN)
missing_data_viz['min_pressure_merged'] = missing_data_viz['min_pressure_merged'].replace(100.0,np.NaN)
plt.figure(figsize=(10,10))
missing_data_ax = sns.heatmap(missing_data_viz[['max_wind_merged', 'min_pressure_merged',
'max_eye_dia_merged', 'track_duration']].isnull(),
cbar=False, cmap='viridis')
missing_data_ax.set_ylabel('Years')
missing_data_ax.set_title('Missing values (yellow) visualized as a heatmap')
Text(0.5,1,'Missing values (yellow) visualized as a heatmap')
All three observation columns - wind speed, atmospheric pressure and eye diameter, suffer from missing values. In general as technology improved over time, we were able to collect better data with fewer missing observations. In the sections below we attempt to fill these values using different techniques. We will compare how they fare and pick one of them for rest of the analysis.
Missing value imputationĀ¶
Technique 1: Drop missing values: An easy way to deal with missing values is to drop those record from analysis. If we were to do that, we lose over a third of the hurricanes.
hurricanes_nona = missing_data_viz.dropna(subset=['max_wind_merged','min_pressure_merged'])
hurricanes_nona.shape
(5857, 19)
Technique 2: Fill using median value: A common technique is to fill using median value (or a different measure of centrality). This technique computes the median of the entire column and applies that to all the missing values.
fill_values = {'max_wind_merged': missing_data_viz['max_wind_merged'].median(),
'min_pressure_merged': missing_data_viz['min_pressure_merged'].median(),
'track_duration_hrs': missing_data_viz['track_duration_hrs'].median()}
hurricanes_fillna = missing_data_viz.fillna(value=fill_values)
Technique 3: Fill by interpolating between existing values: A sophisticated approach is to interploate a missing value based on two of its closest observations.
hurricanes_ipl = missing_data_viz
hurricanes_ipl['max_wind_merged'] = hurricanes_ipl['max_wind_merged'].interpolate()
hurricanes_ipl['min_pressure_merged'] = hurricanes_ipl['min_pressure_merged'].interpolate()
hurricanes_ipl['track_duration_hrs'] = hurricanes_ipl['track_duration_hrs'].interpolate()
Visualize all 3 techniques
To compare how each of these techniques fared, we will plot the histogram of wind speed column after managing for missing values.
fig, ax = plt.subplots(1,3, sharex=True, figsize=(15,5))
fig.suptitle('Comparing effects of missing value imputations on Wind speed column',
fontsize=15)
hurricanes_nona['max_wind_merged'].plot(kind='hist', ax=ax[0], bins=35, title='Drop null values')
hurricanes_fillna['max_wind_merged'].plot(kind='hist', ax=ax[1], bins=35, title='Impute with median')
hurricanes_ipl['max_wind_merged'].plot(kind='hist', ax=ax[2], bins=35, title='Impute via interpolation')
for a in ax:
a.set_xlabel('Wind Speed')
Next, we will plot the histogram of atmospheric pressure column after managing for missing values.
fig, ax = plt.subplots(1,3, sharex=True, figsize=(15,5))
fig.suptitle('Comparing effects of missing value imputations on Pressure column',
fontsize=15)
hurricanes_nona['min_pressure_merged'].plot(kind='hist', ax=ax[0], title='Drop null values')
hurricanes_fillna['min_pressure_merged'].plot(kind='hist', ax=ax[1], title='Impute with median')
hurricanes_ipl['min_pressure_merged'].plot(kind='hist', ax=ax[2], title='Impute via interpolation')
for a in ax:
a.set_xlabel('Atmospheric Pressure')
Fill using interpolation preserves shape of the original distribution. So it will be used for further anlaysis.
Does intensity of hurricanes increase over time?Ā¶
This last part of this study analyzes if there a temporal trend in the intensity of hurricanes. A number of studies have concluded that anthropogenic influences in the form of global climate change make hurricanes worse and dangerous. We analyze if such patterns can be noticed from an empirical standpoint.
Does the number of hurricanes increase over time?Ā¶
ax = all_hurricanes_df['min_season'].hist(bins=50)
ax.set_title('Number of hurricanes per season')
Text(0.5,1,'Number of hurricanes per season')
From the previous notebook, we noticed the number of hurricanes recorded has been steadily increasing, partly due to advancements in technology. We notice a reduction in number of hurricanes after 1970s. Let us split this up by basin and observe the the trend is similar.
fgrid = sns.FacetGrid(data=all_hurricanes_df, col='any_basin', col_wrap=3,
sharex=False, sharey=False)
fgrid.map(plt.hist, 'min_season', bins=50)
fgrid.set_axis_labels(x_var='Seasons', y_var='Frequency of hurricanes')
<seaborn.axisgrid.FacetGrid at 0x1a35c9f908>
Plotting the frequency of hurricanes by basin shows a similar trend with the number of hurricanes reducing globally after 1970s. This is consistent with certain studies (1). However this is only one part of the story. Below, we continue to analyze if the nature of hurricanes itself is changing, while the total number may reduce.
Does hurricane wind speed increase over time?Ā¶
To understand if wind speed increases over time, we create a scatter plot of min_season
against the max_wind_merged
column. The seaborn
plotting library can enhance this plot with correlation coefficient and its level of signifance (p value).
jgrid = sns.jointplot(x='min_season', y='max_wind_merged', data=hurricanes_ipl,
kind='reg', joint_kws={'line_kws':{'color':'green'}}, height=7, space=0.5)
j = jgrid.annotate(stats.pearsonr)
j = jgrid.ax_joint.set_title('Does hurricane wind speed increase over time?')
From the plot above, we notice a small positive correlation. Wind speeds are observed to increase with time. The small p-value
suggests this correlation (albeit small) is statistically significant. The plot above considers hurricanes across all the basins and regresses that against time. To get a finer picture, we need to split the data by basins and observe the correlation.
# since there are not many hurricanes observed over South Atlantic basin (SA),
# we drop it from analysis
hurricanes_major_basins_df = hurricanes_ipl[hurricanes_ipl['any_basin'].isin(
['WP','SI','NA','EP','NI','SP'])]