Xarray - an introduction¶
Purpose Xarray was created to make it easy to work with multidimensional arrays (or tensors). These n-D arrays are common in data science, machine learning and in climate science. Although it is possible to work with n-D arrays entirely in NumPy, you lack the transparency, code readability and the facility to easily apply an operation on a "dataset" of choice.
Core data structures¶
Xarray has 2 core data structures that extend the core strenghts of NumPy
and Pandas
.
DataArray
- labeled n-dim array. It is a n-d generalization ofpandas.Series
Dataset
- is a dict like container ofDataArray
aligned along any number of shared dimensions. It is similar to howpandas.DataFrame
builds onpandas.Series
.
The Dataset
object allows the user to query, extract or combine DataArray
s over a particular dimension across all variables. This pattern quickly becomes convenient when dealing with spatio-temporal datasets.
import numpy as np
# importing as xr is by convention
import xarray as xr
import pandas as pd
Dataset object¶
ds = xr.tutorial.load_dataset("air_temperature")
ds
<xarray.Dataset> Dimensions: (lat: 25, time: 2920, lon: 53) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7 Attributes: Conventions: COARDS title: 4x daily NMC reanalysis (1948) description: Data is from NMC initialized reanalysis\n(4x/day). These a... platform: Model references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
This dataset has air temperature (2920
instances of it) for a set of 25
x 53
lat lon coordinates. The lon
, lat
, time
are coordinates (nD) and air
is a variable.
DataArray object¶
da = ds.air # can use .notation or ds['air'] dict notation
da
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)> array([[[241.2 , 242.5 , 243.5 , ..., 232.79999, 235.5 , 238.59999], [243.79999, 244.5 , 244.7 , ..., 232.79999, 235.29999, 239.29999], [250. , 249.79999, 248.89 , ..., 233.2 , 236.39 , 241.7 ], ..., [296.6 , 296.19998, 296.4 , ..., 295.4 , 295.1 , 294.69998], [295.9 , 296.19998, 296.79 , ..., 295.9 , 295.9 , 295.19998], [296.29 , 296.79 , 297.1 , ..., 296.9 , 296.79 , 296.6 ]], [[242.09999, 242.7 , 243.09999, ..., 232. , 233.59999, 235.79999], [243.59999, 244.09999, 244.2 , ..., 231. , 232.5 , 235.7 ], [253.2 , 252.89 , 252.09999, ..., 230.79999, 233.39 , 238.5 ], ... [293.69 , 293.88998, 295.38998, ..., 295.09 , 294.69 , 294.29 ], [296.29 , 297.19 , 297.59 , ..., 295.29 , 295.09 , 294.38998], [297.79 , 298.38998, 298.49 , ..., 295.69 , 295.49 , 295.19 ]], [[245.09 , 244.29 , 243.29 , ..., 241.68999, 241.48999, 241.79 ], [249.89 , 249.29 , 248.39 , ..., 239.59 , 240.29 , 241.68999], [262.99 , 262.19 , 261.38998, ..., 239.89 , 242.59 , 246.29 ], ..., [293.79 , 293.69 , 295.09 , ..., 295.29 , 295.09 , 294.69 ], [296.09 , 296.88998, 297.19 , ..., 295.69 , 295.69 , 295.19 ], [297.69 , 298.09 , 298.09 , ..., 296.49 , 296.19 , 295.69 ]]], dtype=float32) Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Attributes: long_name: 4xDaily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NMC Reanalysis level_desc: Surface statistic: Individual Obs parent_stat: Other actual_range: [185.16 322.1 ]
To extract just the data, use
air_temp = da.data
print(type(air_temp))
print(air_temp.shape)
<class 'numpy.ndarray'> (2920, 25, 53)
Dimensions, coordinates, attributes¶
A data array may have dimensions that are also coordinates. They may also have dimensions without coordinates
da.dims
('time', 'lat', 'lon')
da.coords
Coordinates: * lat (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * lon (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
da.attrs
{'long_name': '4xDaily Air temperature at sigma level 995', 'units': 'degK', 'precision': 2, 'GRIB_id': 11, 'GRIB_name': 'TMP', 'var_desc': 'Air temperature', 'dataset': 'NMC Reanalysis', 'level_desc': 'Surface', 'statistic': 'Individual Obs', 'parent_stat': 'Other', 'actual_range': array([185.16, 322.1 ], dtype=float32)}
Interop with Pandas¶
# to and from Pandas
air_temp_pd = da.to_series()
air_temp_pd
time lat lon 2013-01-01 00:00:00 75.0 200.0 241.199997 202.5 242.500000 205.0 243.500000 207.5 244.000000 210.0 244.099991 ... 2014-12-31 18:00:00 15.0 320.0 297.389984 322.5 297.190002 325.0 296.489990 327.5 296.190002 330.0 295.690002 Name: air, Length: 3869000, dtype: float32
type(air_temp_pd)
pandas.core.series.Series
Air temp has 3
indices when it is turned to a Pandas Series
da.to_dataframe()
air | |||
---|---|---|---|
time | lat | lon | |
2013-01-01 00:00:00 | 75.0 | 200.0 | 241.199997 |
202.5 | 242.500000 | ||
205.0 | 243.500000 | ||
207.5 | 244.000000 | ||
210.0 | 244.099991 | ||
... | ... | ... | ... |
2014-12-31 18:00:00 | 15.0 | 320.0 | 297.389984 |
322.5 | 297.190002 | ||
325.0 | 296.489990 | ||
327.5 | 296.190002 | ||
330.0 | 295.690002 |
3869000 rows × 1 columns
Composing a DataArray and DataSet¶
Say you have the raw data, how do you compose a DataArray and a DataSet with them?
raw_data = da.data
print(type(raw_data))
print(raw_data.shape)
<class 'numpy.ndarray'> (2920, 25, 53)
raw_data[0,0,1]
242.5
# For now, let us not expand each array
xr.set_options(display_expand_data=False)
<xarray.core.options.set_options at 0x7f9eb1e21d00>
# use DataArray constructor
da2 = xr.DataArray(raw_data, dims=('time','lat','lon'))
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Dimensions without coordinates: time, lat, lon
The coordinates is empty although the data has 3
dimensions. You can set the coordinates using another DataArray object or a numpy array. In this example, lat and long are evenly spaced.
lon_array = np.arange(start=200, stop=331, step=2.5)
print(lon_array.shape)
(53,)
da2.coords['lon'] = lon_array
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 Dimensions without coordinates: time, lat
Similarly, set the latitude and time coordinates
da2.coords['lat'] = np.arange(start=75, stop=14.9, step=-2.5)
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * lat (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 Dimensions without coordinates: time
You can also assign attributes in a similar fashion
da2.attrs['some_attribute'] = 'hello'
da2
<xarray.DataArray (time: 2920, lat: 25, lon: 53)> 241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7 Coordinates: * lon (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * lat (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 Dimensions without coordinates: time Attributes: some_attribute: hello
Composing a DataSet¶
ds2 = xr.Dataset({'air':da2, 'air2':da2}) # just pass a dict like mapping. any number of variables
ds2
<xarray.Dataset> Dimensions: (lon: 53, lat: 25, time: 2920) Coordinates: * lon (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * lat (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 Dimensions without coordinates: time Data variables: air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7 air2 (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
ds2.coords['time'] = pd.date_range(start='2013-01-01', end="2014-12-31 18:00", freq="6H")
ds2
<xarray.Dataset> Dimensions: (lon: 53, lat: 25, time: 2920) Coordinates: * lon (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0 * lat (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0 * time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00 Data variables: air (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7 air2 (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7