xarray-1

Xarray - an introduction

image.png

Purpose Xarray was created to make it easy to work with multidimensional arrays (or tensors). These n-D arrays are common in data science, machine learning and in climate science. Although it is possible to work with n-D arrays entirely in NumPy, you lack the transparency, code readability and the facility to easily apply an operation on a “dataset” of choice.

Core data structures

Xarray has 2 core data structures that extend the core strenghts of NumPy and Pandas.

  • DataArray - labeled n-dim array. It is a n-d generalization of pandas.Series
  • Dataset - is a dict like container of DataArray aligned along any number of shared dimensions. It is similar to how pandas.DataFrame builds on pandas.Series.

The Dataset object allows the user to query, extract or combine DataArrays over a particular dimension across all variables. This pattern quickly becomes convenient when dealing with spatio-temporal datasets.

In [34]:
import numpy as np

# importing as xr is by convention
import xarray as xr
import pandas as pd

Dataset object

In [38]:
ds = xr.tutorial.load_dataset("air_temperature")
ds
Out[38]:
<xarray.Dataset>
Dimensions:  (lat: 25, time: 2920, lon: 53)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

This dataset has air temperature (2920 instances of it) for a set of 25 x 53 lat lon coordinates. The lon, lat, time are coordinates (nD) and air is a variable.

DataArray object

In [7]:
da = ds.air  # can use .notation or ds['air'] dict notation
da
Out[7]:
<xarray.DataArray 'air' (time: 2920, lat: 25, lon: 53)>
array([[[241.2    , 242.5    , 243.5    , ..., 232.79999, 235.5    ,
         238.59999],
        [243.79999, 244.5    , 244.7    , ..., 232.79999, 235.29999,
         239.29999],
        [250.     , 249.79999, 248.89   , ..., 233.2    , 236.39   ,
         241.7    ],
        ...,
        [296.6    , 296.19998, 296.4    , ..., 295.4    , 295.1    ,
         294.69998],
        [295.9    , 296.19998, 296.79   , ..., 295.9    , 295.9    ,
         295.19998],
        [296.29   , 296.79   , 297.1    , ..., 296.9    , 296.79   ,
         296.6    ]],

       [[242.09999, 242.7    , 243.09999, ..., 232.     , 233.59999,
         235.79999],
        [243.59999, 244.09999, 244.2    , ..., 231.     , 232.5    ,
         235.7    ],
        [253.2    , 252.89   , 252.09999, ..., 230.79999, 233.39   ,
         238.5    ],
...
        [293.69   , 293.88998, 295.38998, ..., 295.09   , 294.69   ,
         294.29   ],
        [296.29   , 297.19   , 297.59   , ..., 295.29   , 295.09   ,
         294.38998],
        [297.79   , 298.38998, 298.49   , ..., 295.69   , 295.49   ,
         295.19   ]],

       [[245.09   , 244.29   , 243.29   , ..., 241.68999, 241.48999,
         241.79   ],
        [249.89   , 249.29   , 248.39   , ..., 239.59   , 240.29   ,
         241.68999],
        [262.99   , 262.19   , 261.38998, ..., 239.89   , 242.59   ,
         246.29   ],
        ...,
        [293.79   , 293.69   , 295.09   , ..., 295.29   , 295.09   ,
         294.69   ],
        [296.09   , 296.88998, 297.19   , ..., 295.69   , 295.69   ,
         295.19   ],
        [297.69   , 298.09   , 298.09   , ..., 296.49   , 296.19   ,
         295.69   ]]], dtype=float32)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

To extract just the data, use

In [10]:
air_temp = da.data
print(type(air_temp))
print(air_temp.shape)
<class 'numpy.ndarray'>
(2920, 25, 53)

Dimensions, coordinates, attributes

A data array may have dimensions that are also coordinates. They may also have dimensions without coordinates

In [11]:
da.dims
Out[11]:
('time', 'lat', 'lon')
In [12]:
da.coords
Out[12]:
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * lon      (lon) float32 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
In [13]:
da.attrs
Out[13]:
{'long_name': '4xDaily Air temperature at sigma level 995',
 'units': 'degK',
 'precision': 2,
 'GRIB_id': 11,
 'GRIB_name': 'TMP',
 'var_desc': 'Air temperature',
 'dataset': 'NMC Reanalysis',
 'level_desc': 'Surface',
 'statistic': 'Individual Obs',
 'parent_stat': 'Other',
 'actual_range': array([185.16, 322.1 ], dtype=float32)}

Interop with Pandas

In [15]:
# to and from Pandas
air_temp_pd = da.to_series()
air_temp_pd
Out[15]:
time                 lat   lon  
2013-01-01 00:00:00  75.0  200.0    241.199997
                           202.5    242.500000
                           205.0    243.500000
                           207.5    244.000000
                           210.0    244.099991
                                       ...    
2014-12-31 18:00:00  15.0  320.0    297.389984
                           322.5    297.190002
                           325.0    296.489990
                           327.5    296.190002
                           330.0    295.690002
Name: air, Length: 3869000, dtype: float32
In [16]:
type(air_temp_pd)
Out[16]:
pandas.core.series.Series

Air temp has 3 indices when it is turned to a Pandas Series

In [17]:
da.to_dataframe()
Out[17]:
air
time lat lon
2013-01-01 00:00:00 75.0 200.0 241.199997
202.5 242.500000
205.0 243.500000
207.5 244.000000
210.0 244.099991
2014-12-31 18:00:00 15.0 320.0 297.389984
322.5 297.190002
325.0 296.489990
327.5 296.190002
330.0 295.690002

3869000 rows × 1 columns

Composing a DataArray and DataSet

Say you have the raw data, how do you compose a DataArray and a DataSet with them?

In [18]:
raw_data = da.data
print(type(raw_data))
print(raw_data.shape)
<class 'numpy.ndarray'>
(2920, 25, 53)
In [21]:
raw_data[0,0,1]
Out[21]:
242.5
In [23]:
# For now, let us not expand each array
xr.set_options(display_expand_data=False)
Out[23]:
<xarray.core.options.set_options at 0x7f9eb1e21d00>
In [24]:
# use DataArray constructor
da2 = xr.DataArray(raw_data, dims=('time','lat','lon'))
da2
Out[24]:
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Dimensions without coordinates: time, lat, lon

The coordinates is empty although the data has 3 dimensions. You can set the coordinates using another DataArray object or a numpy array. In this example, lat and long are evenly spaced.

In [26]:
lon_array = np.arange(start=200, stop=331, step=2.5)
print(lon_array.shape)
(53,)
In [28]:
da2.coords['lon'] = lon_array
da2
Out[28]:
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
Dimensions without coordinates: time, lat

Similarly, set the latitude and time coordinates

In [30]:
da2.coords['lat'] = np.arange(start=75, stop=14.9, step=-2.5)
da2
Out[30]:
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
Dimensions without coordinates: time

You can also assign attributes in a similar fashion

In [31]:
da2.attrs['some_attribute'] = 'hello'
da2
Out[31]:
<xarray.DataArray (time: 2920, lat: 25, lon: 53)>
241.2 242.5 243.5 244.0 244.1 243.9 ... 297.9 297.4 297.2 296.5 296.2 295.7
Coordinates:
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
Dimensions without coordinates: time
Attributes:
    some_attribute:  hello

Composing a DataSet

In [33]:
ds2 = xr.Dataset({'air':da2, 'air2':da2})  # just pass a dict like mapping. any number of variables
ds2
Out[33]:
<xarray.Dataset>
Dimensions:  (lon: 53, lat: 25, time: 2920)
Coordinates:
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
Dimensions without coordinates: time
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
    air2     (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
In [37]:
ds2.coords['time'] = pd.date_range(start='2013-01-01', end="2014-12-31 18:00", freq="6H")
ds2
Out[37]:
<xarray.Dataset>
Dimensions:  (lon: 53, lat: 25, time: 2920)
Coordinates:
  * lon      (lon) float64 200.0 202.5 205.0 207.5 ... 322.5 325.0 327.5 330.0
  * lat      (lat) float64 75.0 72.5 70.0 67.5 65.0 ... 25.0 22.5 20.0 17.5 15.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
    air2     (time, lat, lon) float32 241.2 242.5 243.5 ... 296.5 296.2 295.7
In [ ]: