Analyzing hurricane tracks - Part 2/3¶
This is the second part to a three part set of notebooks that process and analyze historic hurricane tracks. In the previous notebook we saw
- downloading historic hurricane datasets using Python
- cleaning and merging hurricane observations using DASK
- aggregating point observations into hurricane tracks using ArcGIS GeoAnalytics server
In this notebook you will analyze the aggregated tracks to answer important questions about prevalance of hurricanes, their seasonality, their density, places where they make landfall and investigate the communities that are most affected.
Import the libraries necessary for this notebook.
# import ArcGIS Libraries
from arcgis.gis import GIS
from arcgis.geometry import filters
from arcgis.geocoding import geocode
from arcgis.features.manage_data import overlay_layers
from arcgis.geoenrichment import enrich
# import Pandas for data exploration
import pandas as pd
import numpy as np
from scipy import stats
# import plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# import display tools
from pprint import pprint
from IPython.display import display
# import system libs
from sys import getsizeof
gis = GIS('home')
import warnings
warnings.filterwarnings('ignore')
Access aggregated hurricane data¶
Below, we access the tracks aggregated using GeoAnalytics in the previous notebook.
hurricane_tracks_item = gis.content.search('title:hurricane_tracks_aggregated_ga')[0]
hurricane_fl = hurricane_tracks_item.layers[0]
The GeoAnalytics step calculated summary statistics of all numeric fields. However only a few of the columns are of interest to us.
pprint([f['name'] for f in hurricane_fl.properties.fields], compact=True, width=80)
['objectid', 'serial_num', 'count', 'count_col_1', 'sum_col_1', 'min_col_1', 'max_col_1', 'mean_col_1', 'range_col_1', 'sd_col_1', 'var_col_1', 'count_season', 'sum_season', 'min_season', 'max_season', 'mean_season', 'range_season', 'sd_season', 'var_season', 'count_num', 'sum_num', 'min_num', 'max_num', 'mean_num', 'range_num', 'sd_num', 'var_num', 'count_basin', 'any_basin', 'count_sub_basin', 'any_sub_basin', 'count_name', 'any_name', 'count_iso_time', 'any_iso_time', 'count_nature', 'any_nature', 'count_center', 'any_center', 'count_track_type', 'any_track_type', 'count_current_basin', 'any_current_basin', 'count_latitude_merged', 'sum_latitude_merged', 'min_latitude_merged', 'max_latitude_merged', 'mean_latitude_merged', 'range_latitude_merged', 'sd_latitude_merged', 'var_latitude_merged', 'count_longitude_merged', 'sum_longitude_merged', 'min_longitude_merged', 'max_longitude_merged', 'mean_longitude_merged', 'range_longitude_merged', 'sd_longitude_merged', 'var_longitude_merged', 'count_wind_merged', 'sum_wind_merged', 'min_wind_merged', 'max_wind_merged', 'mean_wind_merged', 'range_wind_merged', 'sd_wind_merged', 'var_wind_merged', 'count_pressure_merged', 'sum_pressure_merged', 'min_pressure_merged', 'max_pressure_merged', 'mean_pressure_merged', 'range_pressure_merged', 'sd_pressure_merged', 'var_pressure_merged', 'count_grade_merged', 'sum_grade_merged', 'min_grade_merged', 'max_grade_merged', 'mean_grade_merged', 'range_grade_merged', 'sd_grade_merged', 'var_grade_merged', 'count_eye_dia_merged', 'sum_eye_dia_merged', 'min_eye_dia_merged', 'max_eye_dia_merged', 'mean_eye_dia_merged', 'range_eye_dia_merged', 'sd_eye_dia_merged', 'var_eye_dia_merged', 'track_duration', 'end_datetime', 'start_datetime']
Below we select the following fields for the rest of this analysis.
fields_to_query = ['objectid', 'count', 'min_season', 'any_basin', 'any_sub_basin',
'any_name', 'mean_latitude_merged', 'mean_longitude_merged',
'max_wind_merged', 'range_wind_merged', 'min_pressure_merged',
'range_pressure_merged', 'max_eye_dia_merged', 'track_duration',
'end_datetime', 'start_datetime']
Query hurricane tracks into a Spatially enabled DataFrame
¶
%%time
all_hurricanes_df = hurricane_fl.query(out_fields=','.join(fields_to_query), as_df=True)
CPU times: user 1.12 s, sys: 318 ms, total: 1.43 s Wall time: 4.5 s
all_hurricanes_df.shape
(12362, 17)
There are 12,362
hurricanes identified by GeoAnalytics aggregate tracks tool. To get an idea about this aggregated dataset, call the head()
method.
all_hurricanes_df.head()
SHAPE | any_basin | any_name | any_sub_basin | count | end_datetime | max_eye_dia_merged | max_wind_merged | mean_latitude_merged | mean_longitude_merged | min_pressure_merged | min_season | objectid | range_pressure_merged | range_wind_merged | start_datetime | track_duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | {"paths": [[[59.60000000000002, -17.6000000000... | SI | NOT NAMED | MM | 7.0 | 1854-02-10 18:00:00 | NaN | NaN | -19.318571 | 60.639286 | NaN | 1854.0 | 1 | NaN | NaN | 1854-02-08 06:00:00 | 1.296000e+08 |
1 | {"paths": [[[-23.5, 12.5], [-24.19999999999999... | NA | NOT NAMED | NA | 9.0 | 1859-08-26 12:00:00 | NaN | 45.0 | 14.000000 | -26.222222 | NaN | 1859.0 | 2 | NaN | 10.0 | 1859-08-24 12:00:00 | 1.728000e+08 |
2 | {"paths": [[[-23.19999999999999, 12.1000000000... | NA | UNNAMED | NA | 50.0 | 1853-09-12 18:00:00 | NaN | 130.0 | 26.982000 | -51.776000 | 924.0 | 1853.0 | 3 | 53.0 | 90.0 | 1853-08-30 00:00:00 | 1.058400e+09 |
3 | {"paths": [[[59.80000000000001, -15.5], [59.49... | SI | XXXX856017 | MM | 13.0 | 1856-04-05 18:00:00 | NaN | NaN | -20.185385 | 59.573077 | NaN | 1856.0 | 4 | NaN | NaN | 1856-04-02 18:00:00 | 2.592000e+08 |
4 | {"paths": [[[99.60000000000002, -11.5], [98.30... | SI | NOT NAMED | WA | 13.0 | 1861-03-15 18:00:00 | NaN | NaN | -12.940769 | 94.183846 | NaN | 1861.0 | 5 | NaN | NaN | 1861-03-12 18:00:00 | 2.592000e+08 |
To better analyze this data set, the date columns need to be changed to a format that Pandas understands better. This is accomplished by calling to_datetime()
method and passing the appropriate time columns.
all_hurricanes_df['start_datetime'] = pd.to_datetime(all_hurricanes_df['start_datetime'])
all_hurricanes_df['end_datetime'] = pd.to_datetime(all_hurricanes_df['end_datetime'])
all_hurricanes_df.index = all_hurricanes_df['start_datetime']
all_hurricanes_df.head()
SHAPE | any_basin | any_name | any_sub_basin | count | end_datetime | max_eye_dia_merged | max_wind_merged | mean_latitude_merged | mean_longitude_merged | min_pressure_merged | min_season | objectid | range_pressure_merged | range_wind_merged | start_datetime | track_duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
start_datetime | |||||||||||||||||
1854-02-08 06:00:00 | {"paths": [[[59.60000000000002, -17.6000000000... | SI | NOT NAMED | MM | 7.0 | 1854-02-10 18:00:00 | NaN | NaN | -19.318571 | 60.639286 | NaN | 1854.0 | 1 | NaN | NaN | 1854-02-08 06:00:00 | 1.296000e+08 |
1859-08-24 12:00:00 | {"paths": [[[-23.5, 12.5], [-24.19999999999999... | NA | NOT NAMED | NA | 9.0 | 1859-08-26 12:00:00 | NaN | 45.0 | 14.000000 | -26.222222 | NaN | 1859.0 | 2 | NaN | 10.0 | 1859-08-24 12:00:00 | 1.728000e+08 |
1853-08-30 00:00:00 | {"paths": [[[-23.19999999999999, 12.1000000000... | NA | UNNAMED | NA | 50.0 | 1853-09-12 18:00:00 | NaN | 130.0 | 26.982000 | -51.776000 | 924.0 | 1853.0 | 3 | 53.0 | 90.0 | 1853-08-30 00:00:00 | 1.058400e+09 |
1856-04-02 18:00:00 | {"paths": [[[59.80000000000001, -15.5], [59.49... | SI | XXXX856017 | MM | 13.0 | 1856-04-05 18:00:00 | NaN | NaN | -20.185385 | 59.573077 | NaN | 1856.0 | 4 | NaN | NaN | 1856-04-02 18:00:00 | 2.592000e+08 |
1861-03-12 18:00:00 | {"paths": [[[99.60000000000002, -11.5], [98.30... | SI | NOT NAMED | WA | 13.0 | 1861-03-15 18:00:00 | NaN | NaN | -12.940769 | 94.183846 | NaN | 1861.0 | 5 | NaN | NaN | 1861-03-12 18:00:00 | 2.592000e+08 |
The track duration and length columns need to be projected to units (days, hours, miles) that are meaningful for analysis.
all_hurricanes_df['track_duration_hrs'] = all_hurricanes_df['track_duration'] / 3600000
all_hurricanes_df['track_duration_days'] = all_hurricanes_df['track_duration'] / (3600000*24)
Exploratory data analysis¶
In this section we perform exploratory analysis of the dataset and answer some interesting questions.
map1 = gis.map('USA')
map1
all_hurricanes_df.sample(n=500, random_state=2).spatial.plot(map1,
renderer_type='u',
col='any_basin',
cmap='prism')
True
The map above draws a set of 500
hurricanes chosen at random. You can visualize the Spatially Enabled DataFrame object with different types of renderers. In the example above a unique value renderer is applied on the basin column. You can switch the map to 3D mode and view the same on a globe.
map2 = gis.map()
map2.mode= '3D'
map2
all_hurricanes_df.sample(n=500, random_state=2).spatial.plot(map2,
renderer_type='u',
col='any_basin',
cmap='prism')
Does the number of hurricanes increase with time?¶
To understand if number of hurricanes have increased over time, we will plot a histogram of the MIN_Season
column.
ax = sns.distplot(all_hurricanes_df['min_season'], kde=False, bins=50)
ax.set_title('Number of hurricanes recorded over time')
Text(0.5,1,'Number of hurricanes recorded over time')
The number of hurricanes recorded increases steadily until 1970
. This could be due to advances in geospatial technologies allowing scientists to better monitor hurricanes. However, after 1970
we notice a reduction in the number of hurricanes. This is in line with what scientists observe and predict.
How many hurricanes occuer per basin and sub basin?¶
Climate scientists have organized global hurricanes into 7
basins and a number of sub basins. The snippet below plots groups the data by basin and sub basin, counts the occurrences and plots the frequency in bar charts.
fig1, ax1 = plt.subplots(1,2, figsize=(12,5))
basin_ax = all_hurricanes_df['any_basin'].value_counts().plot(kind='bar', ax=ax1[0])
basin_ax.set_title('Number of hurricanes per basin')
basin_ax.set_xticklabels(['Western Pacific', 'South Indian', 'North Atlantic',
'Eastern Pacicifc', 'North Indian','Southern Pacific',
'South Atlantic'])
sub_basin_ax = all_hurricanes_df['any_sub_basin'].value_counts().plot(kind='bar', ax=ax1[1])
sub_basin_ax.set_title('Number of hurricanes per sub basin')
sub_basin_ax.set_xticklabels(['MM','North Atlantic','Bay of Bengal','Western Australia',
'Eastern Australia', 'Carribean Sea', 'Gulf of Mexico',
'Arabian Sea', 'Central Pacific'])
sub_basin_ax.tick_params()
Thus, most hurricanes occur in Wester Pacific basin. This is the region that is east of China, Phillipines and rest of South East Asia. This is followed by South Indian which spans from west of Australia to east of Southern Africa. North Atlantic basin which is the source of hurricanes in the continental United States ranks as the third busiest hurricane basin.
Are certain hurricane names more popular?¶
Pandas provides a handy API called value_counts()
to count unique occurrences. We use that below to count the number of times each hurricane name has been used. We then print the top 25
most frequently used names.
# Get the number of occurrences of top 25 hurricane names
all_hurricanes_df['any_name'].value_counts()[:25]
NOT NAMED 4099 UNNAMED 1408 06B 31 05B 30 04B 30 09B 30 07B 29 08B 29 10B 29 03B 28 01B 27 12B 26 11B 23 13B 23 02B 22 14B 17 SUBTROP:UNNAMED 16 IRMA 15 FLORENCE 15 02A 14 JUNE 13 ALICE 13 OLGA 13 SUSAN 13 FREDA 13 Name: any_name, dtype: int64
Names like FLORENCE
, IRMA
, OLGA
.. appear to be more popular. Interestingly all are of female gender. We can take this further to explore at what time periods have the name FLORENCE
been used?
all_hurricanes_df[all_hurricanes_df['any_name']=='FLORENCE'].index
DatetimeIndex(['1953-09-23 12:00:00', '1954-09-10 12:00:00', '1963-07-14 12:00:00', '1967-01-03 06:00:00', '1964-09-05 18:00:00', '1965-09-08 00:00:00', '1973-07-25 00:00:00', '1960-09-17 06:00:00', '1994-11-02 00:00:00', '1969-09-02 00:00:00', '2012-08-03 06:00:00', '1977-09-20 12:00:00', '2000-09-10 18:00:00', '1988-09-07 06:00:00', '2006-09-03 18:00:00'], dtype='datetime64[ns]', name='start_datetime', freq=None)
The name FLORENCE
had been used consistently in since the 1950s, reaching a peak in popularity during the 60s.
Is there a seasonality in the occurrence of hurricanes?¶
Hurricanes happen when water temperatures (Sea Surface Temperature
SST) are warm. Solar incidence is one of the key factors affecting SST and this typically happens during summer months. However, summer happens during different months in northern and southern hemispheres. To visualize this seasonality, we need to group our data by month as well as basin. Thus the snippet below creates a multilevel index grouper in Pandas
# Create a grouper object
grouper = all_hurricanes_df.start_datetime.dt.month_name()
# use grouper along with basin name to create a multilevel groupby object
hurr_by_basin = all_hurricanes_df.groupby([grouper,'any_basin'], as_index=True)
hurr_by_basin_month = hurr_by_basin.count()[['count', 'min_pressure_merged']]
hurr_by_basin_month.head()
count | min_pressure_merged | ||
---|---|---|---|
start_datetime | any_basin | ||
April | NA | 5 | 2 |
NI | 41 | 5 | |
SI | 242 | 85 | |
SP | 97 | 74 | |
WP | 83 | 56 |
Now we turn the index into columns for further processing.
# turn index into columns
hurr_by_basin_month.reset_index(inplace=True)
hurr_by_basin_month.drop('min_pressure_merged', axis=1, inplace=True)
hurr_by_basin_month.columns = ['month', 'basin', 'count']
hurr_by_basin_month.head()
month | basin | count | |
---|---|---|---|
0 | April | NA | 5 |
1 | April | NI | 41 |
2 | April | SI | 242 |
3 | April | SP | 97 |
4 | April | WP | 83 |
We add the month column back, but this time we will help Pandas understand how to sort months other than by alphabetical order.
fig, ax = plt.subplots(1,1, figsize=(15,7))
month_order = ['January','February', 'March','April','May','June',
'July','August','September','October','November','December']
sns.barplot(x='month', y='count', hue='basin', data=hurr_by_basin_month, ax=ax,
order=month_order)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2705d908>