Monday, August 11th, 2025 (2 days ago)
TLDR: Xarray>2025.6 has been through a major refactoring of its internals that makes coordinate-based data selection and alignment more customizable, via built-in and/or 3rd party indexes! In this post we highlight a few examples that take advantage of this new superpower. If you're interested in this topic, also check out Deepak Cherian's 2025 SciPy Presentation and Slides!
First thing's first, what is an index
and why is it helpful?
An index makes repeated subsetting and selection of data more efficient.
Examples of indexes are all around you and are a fundamental way to organize and simplify access to information.
If you want a book about Natural Sciences, you can go to your local library branch and head straight to section 500
.
Or if you're in the mood for a good novel go to section 800
thanks to the Dewey Decimal System (credit to Dewey, 1876)!
Some indexes are less universal and more multi-dimensional: In my local grocery store I know that aisle 12, top shelf has the best cereal. And the second shelf on aisle 1 has the yogurt. In this example, aisles 1-12 and shelves 1-5 are the coordinates of our grocery, but the more infomative aisle content labels are the indexes. Once you've mentally asigned labels to your grocery, you can get what you want quickly without needing to wander around!
The same efficiencies arise in computing. Consider a simple 1D dataset consisting of measurements Y=[10,20,30,40,50,60]
at six positions X=[1, 2, 4, 8, 16, 32]
. What was our measurement at X=8
?
To extract the answer in code we can loop over all the values of X
to find X=8
. In Python conventions we find it at position 3, then use that to get our answer Y[3]=40
.
💡 Note:
With only 6 coordinates, we easily see X[3]=8
, but for large datasets we should loop over all the coordinates to ensure there are no repeated values!
This initial pass over all the coordinates to build an index takes some time and may not always be desireable.
In the example above the index is simply a key:value mapping between the coordinate values and integer positions i=[0,1,2,3,4,5]
in the coordinates array.
This in fact is the default Pandas.Index! And this is what Xarray has exclusively used until now to power label-based selection:
1# NOTE Xarray builds a "PandasIndex" for coordinate 'x' by default 2x = np.array([1, 2, 4, 8, 16, 32]) 3y = np.array([10, 20, 30, 40, 50, 60]) 4da = xr.DataArray(y, coords={'x': x}) 5da 6
1da.sel(x=3) 2# 40 3
Importantly, a loop over all the coordinate values is not the only way to create an index.
You might recognize that our coordinates about can in fact be represented by a function X(i)=2**i
where i
is the integer position! Given that information we can quickly get measurement values at any poisition: Y(X=4)
= Y[sqrt(4)]
= Y[2]=30
.
Often, coordinates are even simplier and can be definied by a start, stop, and uniform step size.
For this common case, Xarray added a built-in RangeIndex
that bypasses Pandas.
Note that coordinates are now calculated on-the-fly rather than loaded into memory up-front when creating a Dataset.
1from xarray.indexes import RangeIndex 2 3index = RangeIndex.arange(0.0, 100_000, 0.1, dim='x') 4ds = xr.Dataset(coords=xr.Coordinates.from_xindex(index)) 5ds 6
A lot of work over the last several years has gone into the nuts and bolts of Xarray to make it possible to plug in new Indexes. Here we'll highlight a few examples!
real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc. Xarray Docs
We often think about metadata providing context for measurement values but metadata is also critical for coordinates! In particular, to align two different datasets we must ask if the coordinates are in the same coordinate system. In other words, do they share the same origin and scale?
There are currently over 7000 commonly used Coordinate Reference Systems (CRS) for geospatial data in the authoritative EPSG database!
And of course an infinite number of custom-defined CRSs.
xproj.CRSIndex gives Xarray objects an automatic awareness of the coordinate reference system operations like xr.align()
no longer succeed when they should raise an error:
1from xproj import CRSIndex 2lons1 = np.arange(-125, -120, 1) 3lons2 = np.arange(-122, -118, 1) 4ds1 = xr.Dataset(coords={'longitude': lons1}).proj.assign_crs(crs=4267) 5ds2 = xr.Dataset(coords={'longitude': lons2}).proj.assign_crs(crs=4326) 6ds1 + ds2 7
1MergeError: conflicting values/indexes on objects to be combined for coordinate 'crs' 2
Earlier we mentioned that coordinates often have a functional representation. For 2D geospatial raster images, this function often takes the form of an Affine Transform. This how the rasterix RasterIndex computes coordinates rather than storing them all in memory. Also alignment by comparing transforms minimizes common errors due to floating point mismatches.
Below is a simple example of slicing a large mosaic of GeoTiffs without ever loading the coordiantes into memory, note that a new Affine is defined after the slicing operation:
1# 811816322401 values! 2import rasterix 3 4#26475 GeoTiffs represented by a GDAL VRT 5da = xr.open_dataarray('https://opentopography.s3.sdsc.edu/raster/COP30/COP30_hh.vrt', 6 engine='rasterio', 7 parse_coordinates=False).squeeze().pipe( 8 rasterix.assign_index 9) 10da 11
1print('Original geotransform:\n', da.xindexes['x'].transform()) 2da_sliced = da.sel(x=slice(-122.4, -120.0), y=slice(-47.1,-49.0)) 3print('Sliced geotransform:\n', da_sliced.xindexes['x'].transform()) 4
Original geotransform: | 0.00, 0.00,-180.00| | 0.00,-0.00, 84.00| | 0.00, 0.00, 1.00| Sliced geotransform: | 0.00, 0.00,-122.40| | 0.00,-0.00,-47.10| | 0.00, 0.00, 1.00|
A "vector data cube" is an n-D array that has at least one dimension indexed by a 2-D array of vector geometries.
Large vector cubes can take advantage of an R-tree spatial index for efficiently selecting vector geometries within a given bounding box.
The XVec.GeometryIndex
provides this functionality, below is a short code snippet but please refer to the documentation for more!
1import xvec 2import geopandas as gpd 3from geodatasets import get_path 4 5# Dataset that contains demographic data indexed by U.S. counties 6counties = gpd.read_file(get_path("geoda.natregimes")) 7 8cube = xr.Dataset( 9 data_vars=dict( 10 population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]), 11 unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]), 12 ), 13 coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]), 14).xvec.set_geom_indexes("county", crs=counties.crs) 15cube 16
1# Efficient selection using shapely.STRtree 2from shapely.geometry import box 3 4subset = cube.xvec.query( 5 "county", 6 box(-125.4, 40, -120.0, 50), 7 predicate="intersects", 8) 9 10subset['population'].xvec.plot(col='year'); 11
While we're extremely excited about what can already be accomplished with the new indexing capabilities, there are plenty of exciting ideas for future work.
Have an idea for your own custom index? Check out this section of the Xarray documentation. Also check out the A Gallery of Custom Index Examples!
There are a few new indexes that will soon become part of the Xarray codebase!
This work would not have been possible without technical input from the Xarray core team and community! Several developers received essential funding from a CZI Essential Open Source Software for Science (EOSS) grant as well as NASA's Open Source Tools, Frameworks, and Libraries (OSTFL) grant 80NSSC22K0345.