Analyze New York City taxi data

**Note:** Refer here for instructions to download and run this sample locally on your computer

Analyzing New York City taxi data using big data tools

At 10.5, ArcGIS Enterprise introduces ArcGIS GeoAnalytics Server which provides you the ability to perform big data analysis on your infrastructure. This sample demonstrates the steps involved in performing an aggregation analysis on New York city taxi point data using ArcGIS API for Python.

The data used in this sample can be downloaded from NYC Taxi & Limousine Commission website. For this sample, data for the months January & Febuary of 2015 were used, each averaging 12 million records.

Note: The ability to perform big data analysis is only available on ArcGIS Enterprise 10.5 licensed with a GeoAnalytics server and not yet available on ArcGIS Online.

The NYC taxi data

To give you an overview, let us take a look at a subset with 2000 points published as a feature service.

In [1]:
import arcgis
from arcgis.gis import GIS

ago_gis = GIS() # Connect to ArcGIS Online as an anonymous user
search_subset = ago_gis.content.search("NYC_taxi_subset", item_type = "Feature Layer")
subset_item = search_subset[0]
subset_item
Out[1]:
NYC_taxi_subset
A subset of NYC taxi dataFeature Layer Collection by atma.mani
Last Modified: September 14, 2016
0 comments, 2 views

Let us bring up a map to display the data.

In [2]:
subset_map = ago_gis.map("New York, NY", zoomlevel=11)
subset_map
In [3]:
subset_map.add_layer(subset_item)

Let us access the feature layers and their attribute table to understand the structure of our data. In the cell below, we are using the query() method to get the attribute information. The query() method returns a FeatureSet object which can be considered as a collection of individual Feature objects.

You can mine through the FeatureSet, get individual Features and read their attribute information to compose a table of all features and their attributes. However, the FeatureSet object provides a much easier way to get that information. Using the df property of a FeatureSet, you can read the attribute information as a pandas dataframe object.

To run this cell, you need to have pandas Python package installed. If you get an error that pandas cannot be found, you can install it by typing the following in your terminal that is running the jupyter notebook.

conda install pandas
In [4]:
subset_feature_layer = subset_item.layers[0]

# query the attribute information. Limit to first 5 rows.
query_result = subset_feature_layer.query(where = 'OBJECTID < 5',
                                          out_fields = "*", 
                                          returnGeometry = False)

att_data_frame = query_result.df # get as a Pandas dataframe
att_data_frame
Out[4]:
Field1 RateCodeID VendorID dropoff_latitude dropoff_longitude extra fare_amount improvement_surcharge mta_tax passenger_count payment_type pickup_latitude pickup_longitude store_and_fwd_flag tip_amount tolls_amount total_amount tpep_dropoff_datetime tpep_pickup_datetime trip_distance
OBJECTID
1 3479320 1 2 40.782318 -73.980492 0.0 9.5 0.3 0.5 5 1 40.778149 -73.956291 N 2.1 0 12.4 1422268943000 1422268218000 1.76
2 8473342 1 2 40.769756 -73.950600 0.5 13.5 0.3 0.5 1 2 40.729458 -73.983864 N 0.0 0 14.8 1422137577000 1422136892000 3.73
3 10864374 1 2 40.753040 -73.985680 0.0 14.5 0.3 0.5 1 2 40.743740 -73.987617 N 0.0 0 15.3 1422719906000 1422718711000 2.84
4 7350094 1 2 40.765743 -73.954994 0.0 11.5 0.3 0.5 1 2 40.757507 -73.981682 N 0.0 0 12.3 1420907558000 1420906601000 2.18

The table above represents the attribute information available from the NYC dataset. Columns like pickup, dropoff locations, fare, tips, toll, trip distance provide a wealth of infomation allowing many interesting patterns to be observed. Our full data dataset contains over 24 million points. To discern patterns out of it, let us aggregate the points into square blocks of 1 Kilometer length.

Searching for big data file shares

To process the csv data you have downloaded using GeoAnalyitcs Server, you need to register the data with your Geoanalytics Server. In this sample the data is in multiple csv files, which will be registered as a big data file share.

Let us connect to an ArcGIS Enterprise.

In [5]:
gis = GIS("https://yourportal.domain.com/webcontext", "username", "password")

Ensure that the Geoanalytics is supported with our GIS.

In [6]:
arcgis.geoanalytics.is_supported()
Out[6]:
True

Get the geoanalytics datastores and search it for the registered datasets:

In [7]:
datastores = arcgis.geoanalytics.get_datastores()
In [8]:
bigdata_fileshares = datastores.search()
bigdata_fileshares
Out[8]:
[<Datastore title:"/bigDataFileShares/Chicago_accidents" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_1m_168yrs" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/hurricanes_all" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/Hurricane_tracks" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/NYCdata" type:"bigDataFileShare">,
 <Datastore title:"/bigDataFileShares/NYC_taxi" type:"bigDataFileShare">]

NYC_taxi data is registered as a big data file share with the Geoanalytics datastore, so we can reference it:

In [9]:
data_item = bigdata_fileshares[5]

Registering big data file shares

The code below shows how a big data file share can be registered with the geoanalytics datastores, in case it’s not already registered.

In [10]:
data_item = datastores.add_bigdata("NYCdata", r"\\teton\atma_shared\datasets\NYC_taxi")
Big Data file share exists for NYCdata

Once a big data file share is created, the GeoAnalytics server processes all the valid file types to discern the schema of the data. This process can take a few minutes depending on the size of your data. Once processed, querying the manifest property returns the schema. As you can see from below, the schema is similar to the subset we observed earlier in this sample.

In [11]:
data_item.manifest
Out[11]:
{'datasets': [{'format': {'encoding': 'UTF-8',
    'extension': 'csv',
    'fieldDelimiter': ',',
    'hasHeaderRow': True,
    'quoteChar': '"',
    'recordTerminator': '\n',
    'type': 'delimited'},
   'geometry': {'fields': [{'formats': ['x'], 'name': 'pickup_longitude'},
     {'formats': ['y'], 'name': 'pickup_latitude'}],
    'geometryType': 'esriGeometryPoint',
    'spatialReference': {'wkid': 4326}},
   'name': 'sampled',
   'schema': {'fields': [{'name': 'VendorID',
      'type': 'esriFieldTypeBigInteger'},
     {'name': 'tpep_pickup_datetime', 'type': 'esriFieldTypeString'},
     {'name': 'tpep_dropoff_datetime', 'type': 'esriFieldTypeString'},
     {'name': 'passenger_count', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'trip_distance', 'type': 'esriFieldTypeDouble'},
     {'name': 'pickup_longitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'pickup_latitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'RateCodeID', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'store_and_fwd_flag', 'type': 'esriFieldTypeString'},
     {'name': 'dropoff_longitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'dropoff_latitude', 'type': 'esriFieldTypeDouble'},
     {'name': 'payment_type', 'type': 'esriFieldTypeBigInteger'},
     {'name': 'fare_amount', 'type': 'esriFieldTypeDouble'},
     {'name': 'extra', 'type': 'esriFieldTypeDouble'},
     {'name': 'mta_tax', 'type': 'esriFieldTypeDouble'},
     {'name': 'tip_amount', 'type': 'esriFieldTypeDouble'},
     {'name': 'tolls_amount', 'type': 'esriFieldTypeDouble'},
     {'name': 'improvement_surcharge', 'type': 'esriFieldTypeDouble'},
     {'name': 'total_amount', 'type': 'esriFieldTypeDouble'}]},
   'time': {'fields': [{'formats': ['MM/dd/yyyy HH:mm'],
      'name': 'tpep_pickup_datetime'}],
    'timeReference': {'timeZone': 'UTC'},
    'timeType': 'instant'}}]}

Performing data aggregation

When you add a big data file share datastore, a corresponding item gets created on your portal. You can search for it like a regular item and query its layers.

In [12]:
search_result = gis.content.search("", item_type = "big data file share")
search_result
Out[12]:
[<Item title:"bigDataFileShares_hurricanes_1m_168yrs" type:Big Data File Share owner:admin>,
 <Item title:"bigDataFileShares_NYC_taxi" type:Big Data File Share owner:admin>,
 <Item title:"bigDataFileShares_hurricanes" type:Big Data File Share owner:admin>,
 <Item title:"bigDataFileShares_Chicago_accidents" type:Big Data File Share owner:admin>,
 <Item title:"bigDataFileShares_hurricanes_all" type:Big Data File Share owner:admin>,
 <Item title:"bigDataFileShares_NYCdata" type:Big Data File Share owner:admin>]
In [13]:
data_item = search_result[5]
data_item
Out[13]:
bigDataFileShares_NYCdata
Big Data File Share by admin
Last Modified: December 01, 2016
0 comments, 0 views
In [14]:
data_item.layers
Out[14]:
[<Layer url:"https://yourserver.domain.com/webcontext/rest/services/DataStoreCatalogs/bigDataFileShares_NYCdata/BigDataCatalogServer/sampled">]
In [15]:
year_2015 = data_item.layers[0]
year_2015
Out[15]:
<Layer url:"https://yourserver.domain.com/webcontext/rest/services/DataStoreCatalogs/bigDataFileShares_NYCdata/BigDataCatalogServer/sampled">

Aggregate points tool

The aggregate_points() tool can be accessed through the tools.bigdata property of your GIS. In this example, we are using this tool to aggregate the numerous points into 1 Kilometer square blocks. The tool creates a feature layer as an output which can be accessed once the processing is complete.

In [16]:
from arcgis.geoanalytics.summarize_data import aggregate_points
In [17]:
arcgis.env.process_spatial_reference=3857
In [18]:
agg_result = aggregate_points(year_2015, bin_size=1, bin_size_unit='Kilometers')
Submitted.
Executing...
Executing (AggregatePoints): AggregatePoints "Feature Set" # 1 Kilometers # # # # # # # {"serviceProperties":{"name":"Aggregate_Points_Analysis_7PE5QR","serviceUrl":"http://yourserver.domain.com/webcontext/rest/services/Hosted/Aggregate_Points_Analysis_7PE5QR/FeatureServer"},"itemProperties":{"itemId":"42add58c05d54ae1a492175894410b02"}} {"processSR":{"wkid":3857}}
Start Time: Wed Dec 14 12:38:07 2016
Using URL based GPRecordSet param: https://yourserver.domain.com/webcontext/rest/services/DataStoreCatalogs/bigDataFileShares_NYCdata/BigDataCatalogServer/sampled?token=8wIONMIEQR-QQPpTaXJ_Tz-PRPWtYX9AQapkiMAa0GbMYQORQuzNMAWAjN641uvgg5yNJnXinc_NG3OUG0d0u2vySZ-r9BePKOMBuA_hEy0boVvNZ5_bGGeMl0v-4agYjkZKfOJxm4zZWEQoXJqx67WHktNU_s2snZ0QGE-wKLE.
{"messageCode":"BD_101033","message":"'pointLayer' will be projected into the processing spatial reference.","params":{"paramName":"pointLayer"}}
{"messageCode":"BD_101028","message":"Starting new distributed job with 6 tasks.","params":{"totalTasks":"6"}}
{"messageCode":"BD_101029","message":"1/6 distributed tasks completed.","params":{"completedTasks":"1","totalTasks":"6"}}
{"messageCode":"BD_101029","message":"6/6 distributed tasks completed.","params":{"completedTasks":"6","totalTasks":"6"}}
  interval = None
  count = 152
{"messageCode":"BD_0","message":"Feature service layer created: http://yourserver.domain.com/webcontext/rest/services/Hosted/Aggregate_Points_Analysis_7PE5QR/FeatureServer/0","params":{"serviceUrl":"http://yourserver.domain.com/webcontext/rest/services/Hosted/Aggregate_Points_Analysis_7PE5QR/FeatureServer/0"}}
Succeeded at Wed Dec 14 12:38:15 2016 (Elapsed Time: 8.00 seconds)

Inspect the results

Let us create a map and load the processed result which is a feature layer item.

In [19]:
processed_map = gis.map('New York, NY', 11)
processed_map
In [20]:
processed_map.add_layer(agg_result)

Let us inspect the analysis result using smart mapping. To learn more about this visualization capability, refer to the guide on Smart Mapping under the ‘Mapping and Visualization’ section.

In [21]:
map2 = gis.map("New York, NY", 11)
map2
In [22]:
map2.add_layer(agg_result, {
                "renderer":"ClassedColorRenderer",
                "field_name":"MAX_tip_amount", 
                "normalizationField":'MAX_trip_distance',
                "classificationMethod":'natural-breaks',
                "opacity":0.75
              })

We can now start seeing patterns, such as which pickup areas resulted in higher tips for the cab drivers.