How To Apply A Xarray U_function Over Netcdf And Return A 2d-array (multiple New Variables) To The Dataset

December 26, 2023 Post a Comment

I am trying to use the xarray apply_ufunc to apply a given function f over all pairs of coordinates (i.e. pixels) in the Dataset. The function f returns a 2D array (NxN matrix) as

Solution 1:

It is difficult to get familiar with xarray.apply_ufunc it allows a really wide range of possibilities and it is not always clear how to make the most out of it. In this case, the error is due to input_core_dims and output_core_dims. I'll first extend their docs emphasizing on what I believe has caused the confusion and then provide a couple of solutions. Their docs are:

input_core_dims

List of the same length as args giving the list of core dimensions on each input argument that should not be broadcast. By default, we assume there are no core dimensions on any input arguments.
For example, input_core_dims=[[], ['time']] indicates that all dimensions on the first argument and all dimensions other than ‘time’ on the second argument should be broadcast.
Core dimensions are automatically moved to the last axes of input variables before applying func, which facilitates using NumPy style generalized ufuncs [2].

It takes care of 2 important and related aspects of the computation. First, it defines the dimensions to be broadcast, this is particularly important because the shape of the output is assumed to be the same as the shape defined by these broadcasted dimensions (when this is not the case, output_core_dims must be used). Secondly, the input_core_dims are moved to the end. Below there are two examples:

We can apply a function that does not modify the shape without any extra argument to apply_ufunc:

xr.apply_ufunc(lambda x: x**2, ds)
# Output
<xarray.DataArray (lon: 10, lat: 10, time: 30)>
array([[[6.20066642e+00, 1.68502086e+00, 9.77868899e-01, ...,
         ...,
         2.28979668e+00, 1.76491683e+00, 2.17085164e+00]]])
Coordinates:
  * lon      (lon) int64 50515253545556575859
  * lat      (lat) int64 10111213141516171819
  * time     (time) int64 101112131415161718 ... 3233343536373839

To calculate the mean along lon dimension for instance, we reduce one of the dimensions, therefore, the output will have one dimension less than the input: we must pass lon as an input_core_dim:

xr.apply_ufunc(lambda x: x.mean(axis=-1), ds, input_core_dims=[["lon"]])
# Output
<xarray.DataArray (lat: 10, time: 30)>
array([[ 7.72163214e-01,  3.98689228e-01,  9.36398702e-03,
         ...,
        -3.70034281e-01, -4.57979868e-01,  1.29770762e-01]])
Coordinates:
  * lat      (lat) int64 10111213141516171819
  * time     (time) int64 101112131415161718 ... 3233343536373839

Note that we are doing the mean on axis=-1 even though lon is the first dimension because it will be moved to the end as it is an input_core_dims. We could therefore calculate the mean along lat dim using input_core_dims=[["lon"]].

Note also the format of input_core_dims, it must be a list of lists: List of the same length as args giving the list of core dimensions. A tuple of tuples (or any sequence) is also valid, however, note that with tuples the 1 element case it is (("lon",),) not (("lon")).

output_core_dims

List of the same length as the number of output arguments from func, giving the list of core dimensions on each output that were not broadcast on the inputs. By default, we assume that func outputs exactly one array, with axes corresponding to each broadcast dimension.
Core dimensions are assumed to appear as the last dimensions of each output in the provided order.

Here again, output_core_dims is a list of lists. It must be used when there are multiple outputs (that is, func returns a tuple) or when the output has extra dimensions in addition to the broadcasted dimensions. Obviously, if there are multiple outputs with extra dims, it must be used too. We'll use the two possible solutions as examples.

Solution 1

Use the function posted in the question. This function returns a tuple, therefore we need to use output_core_dims even though the shape of the arrays is not modified. As there are actually no extra dims, we'll pass an empty list per output:

xr.apply_ufunc(
    f,
    ds,
    output_core_dims= [[] for _ in range(4)], 
)

This will return a tuple of DataArrays, its output would be exactly the same as f(ds).

Solution 2

We'll now modify the function to output a single array, stacking all 4 outputs in the tuple. Note that we have to make sure that this new dimension is added at the end of the array:

def f2(x):
    return np.stack((x, x**2, x**3, x**4), axis=-1)

xr.apply_ufunc(
    f2,
    ds,
    output_core_dims= [["predictions"]], 
)
# Output
<xarray.DataArray (lon: 10, lat: 10, time: 30, predictions: 4)>
array([[[[ 2.49011374e+00,  6.20066642e+00,  1.54403646e+01,
           ...,
           4.71259686e+00]]]])
Coordinates:
  * lon      (lon) int64 50515253545556575859
  * lat      (lat) int64 10111213141516171819
  * time     (time) int64 101112131415161718 ... 3233343536373839
Dimensions without coordinates: predictions

We have now passed predictions as output core dim which makes the output have predictions as a new dimension in addition to the original 3. Here the output is not equivalent to f2(ds) (it returns a numpy array) anymore because thanks to using apply_ufunc we have been able to perform several functions and stacking without loosing the labels.

Side note: it is generally not recommended to use mutable objects as defaults arguments in functions: see for example "Least Astonishment" and the Mutable Default Argument

Free Interactive Python Tutorial