Xarray - an introduction¶

Purpose Xarray was created to make it easy to work with multidimensional arrays (or tensors). These n-D arrays are common in data science, machine learning and in climate science. Although it is possible to work with n-D arrays entirely in NumPy, you lack the transparency, code readability and the facility to easily apply an operation on a "dataset" of choice.
Core data structures¶
Xarray has 2 core data structures that extend the core strenghts of NumPy and Pandas.

DataArray- labeled n-dim array. It is a n-d generalization ofpandas.SeriesDataset- is a dict like container ofDataArrayaligned along any number of shared dimensions. It is similar to howpandas.DataFramebuilds onpandas.Series.
The Dataset object allows the user to query, extract or combine DataArrays over a particular dimension across all variables. This pattern quickly becomes convenient when dealing with spatio-temporal datasets.
import numpy as np
# importing as xr is by convention
import xarray as xr
import pandas as pd
Dataset object¶
ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset>
Dimensions: (lat: 25, time: 2920, lon: 53)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...This dataset has air temperature (2920 instances of it) for a set of 25 x 53 lat lon coordinates. The lon, lat, time are coordinates (nD) and air is a variable.
DataArray object¶
da = ds.air # can use .notation or ds['air'] dict notation
da
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 ,
238.59999],
[243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999,
239.29999],
[250. , 249.79999, 248.89 , ..., 233.2 , 236.39 ,
241.7 ],
...,
[296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 ,
294.69998],
[295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 ,
295.19998],
[296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 ,
296.6 ]],
[[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999,
235.79999],
[243.59999, 244.09999, 244.2 , ..., 231. , 232.5 ,
235.7 ],
[253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 ,
238.5 ],
...
[293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 ,
294.29 ],
[296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 ,
294.38998],
[297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 ,
295.19 ]],
[[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999,
241.79 ],
[249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 ,
241.68999],
[262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 ,
246.29 ],
...,
[293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 ,
294.69 ],
[296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 ,
295.19 ],
[297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 ,
295.69 ]]], dtype=float32)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]To extract just the data, use
air_temp = da.data
print(type(air_temp))
print(air_temp.shape)
<class 'numpy.ndarray'> (2920, 25, 53)
Dimensions, coordinates, attributes¶
A data array may have dimensions that are also coordinates. They may also have dimensions without coordinates
da.dims
('time', 'lat', 'lon')
da.coords
Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
da.attrs
{'long_name': '4xDaily Air temperature at sigma level 995',
'units': 'degK',
'precision': 2,
'GRIB_id': 11,
'GRIB_name': 'TMP',
'var_desc': 'Air temperature',
'dataset': 'NMC Reanalysis',
'level_desc': 'Surface',
'statistic': 'Individual Obs',
'parent_stat': 'Other',
'actual_range': array([185.16, 322.1 ], dtype=float32)}
Interop with Pandas¶
# to and from Pandas
air_temp_pd = da.to_series()
air_temp_pd
time lat lon
2013-01-01 00:00:00 75.0 200.0 241.199997
202.5 242.500000
205.0 243.500000
207.5 244.000000
210.0 244.099991
...
2014-12-31 18:00:00 15.0 320.0 297.389984
322.5 297.190002
325.0 296.489990
327.5 296.190002
330.0 295.690002
Name: air, Length: 3869000, dtype: float32
type(air_temp_pd)
pandas.core.series.Series
Air temp has 3 indices when it is turned to a Pandas Series
da.to_dataframe()
| air | |||
|---|---|---|---|
| time | lat | lon | |
| 2013-01-01 00:00:00 | 75.0 | 200.0 | 241.199997 |
| 202.5 | 242.500000 | ||
| 205.0 | 243.500000 | ||
| 207.5 | 244.000000 | ||
| 210.0 | 244.099991 | ||
| ... | ... | ... | ... |
| 2014-12-31 18:00:00 | 15.0 | 320.0 | 297.389984 |
| 322.5 | 297.190002 | ||
| 325.0 | 296.489990 | ||
| 327.5 | 296.190002 | ||
| 330.0 | 295.690002 |
3869000 rows × 1 columns
Composing a DataArray and DataSet¶
Say you have the raw data, how do you compose a DataArray and a DataSet with them?
raw_data = da.data
print(type(raw_data))
print(raw_data.shape)
<class 'numpy.ndarray'> (2920, 25, 53)
raw_data[0,0,1]
242.5
# For now, let us not expand each array
xr.set_options(display_expand_data=False)
<xarray.core.options.set_options at 0x7f9eb1e21d00>
# use DataArray constructor
da2 = xr.DataArray(raw_data, dims=('time','lat','lon'))
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Dimensions without coordinates: time, lat, lon
The coordinates is empty although the data has 3 dimensions. You can set the coordinates using another DataArray object or a numpy array. In this example, lat and long are evenly spaced.
lon_array = np.arange(start=200, stop=331, step=2.5)
print(lon_array.shape)
(53,)
da2.coords['lon'] = lon_array
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 Dimensions without coordinates: time, lat
Similarly, set the latitude and time coordinates
da2.coords['lat'] = np.arange(start=75, stop=14.9, step=-2.5)
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * lat (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 Dimensions without coordinates: time
You can also assign attributes in a similar fashion
da2.attrs['some_attribute'] = 'hello'
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
* lon (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* lat (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
Dimensions without coordinates: time
Attributes:
some_attribute: helloComposing a DataSet¶
ds2 = xr.Dataset({'air':da2, 'air2':da2}) # just pass a dict like mapping. any number of variables
ds2
<xarray.Dataset>
Dimensions: (lon: 53, lat: 25, time: 2920)
Coordinates:
* lon (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* lat (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
Dimensions without coordinates: time
Data variables:
air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
air2 (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7ds2.coords['time'] = pd.date_range(start='2013-01-01', end="2014-12-31 18:00", freq="6H")
ds2
<xarray.Dataset>
Dimensions: (lon: 53, lat: 25, time: 2920)
Coordinates:
* lon (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
* lat (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
air2 (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7