How To Apply A Xarray U_function Over Netcdf And Return A 2d-array (multiple New Variables) To The Dataset
Solution 1:
It is difficult to get familiar with xarray.apply_ufunc
it allows a really wide range of possibilities and it is not always clear how to make the most out of it. In this case, the error is due to input_core_dims
and output_core_dims
. I'll first extend their docs emphasizing on what I believe has caused the confusion and then provide a couple of solutions. Their docs are:
input_core_dims
List of the same length as args giving the list of core dimensions on each input argument that should not be broadcast. By default, we assume there are no core dimensions on any input arguments.
For example, input_core_dims=[[], ['time']] indicates that all dimensions on the first argument and all dimensions other than ‘time’ on the second argument should be broadcast.
Core dimensions are automatically moved to the last axes of input variables before applying func, which facilitates using NumPy style generalized ufuncs [2].
It takes care of 2 important and related aspects of the computation. First, it defines the dimensions to be broadcast, this is particularly important because the shape of the output is assumed to be the same as the shape defined by these broadcasted dimensions (when this is not the case, output_core_dims
must be used). Secondly, the input_core_dims
are moved to the end. Below there are two examples:
We can apply a function that does not modify the shape without any extra argument to apply_ufunc
:
xr.apply_ufunc(lambda x: x**2, ds)
# Output
<xarray.DataArray (lon: 10, lat: 10, time: 30)>
array([[[6.20066642e+00, 1.68502086e+00, 9.77868899e-01, ...,
...,
2.28979668e+00, 1.76491683e+00, 2.17085164e+00]]])
Coordinates:
* lon (lon) int64 50515253545556575859
* lat (lat) int64 10111213141516171819
* time (time) int64 101112131415161718 ... 3233343536373839
To calculate the mean along lon
dimension for instance, we reduce one of the dimensions, therefore, the output will have one dimension less than the input: we must pass lon
as an input_core_dim
:
xr.apply_ufunc(lambda x: x.mean(axis=-1), ds, input_core_dims=[["lon"]])
# Output
<xarray.DataArray (lat: 10, time: 30)>
array([[ 7.72163214e-01, 3.98689228e-01, 9.36398702e-03,
...,
-3.70034281e-01, -4.57979868e-01, 1.29770762e-01]])
Coordinates:
* lat (lat) int64 10111213141516171819
* time (time) int64 101112131415161718 ... 3233343536373839
Note that we are doing the mean on axis=-1
even though lon
is the first dimension because it will be moved to the end as it is an input_core_dims
. We could therefore calculate the mean along lat
dim using input_core_dims=[["lon"]]
.
Note also the format of input_core_dims
, it must be a list of lists: List of the same length as args giving the list of core dimensions. A tuple of tuples (or any sequence) is also valid, however, note that with tuples the 1 element case it is (("lon",),)
not (("lon"))
.
output_core_dims
List of the same length as the number of output arguments from func, giving the list of core dimensions on each output that were not broadcast on the inputs. By default, we assume that func outputs exactly one array, with axes corresponding to each broadcast dimension.
Core dimensions are assumed to appear as the last dimensions of each output in the provided order.
Here again, output_core_dims
is a list of lists. It must be used when there are multiple outputs (that is, func returns a tuple) or when the output has extra dimensions in addition to the broadcasted dimensions. Obviously, if there are multiple outputs with extra dims, it must be used too. We'll use the two possible solutions as examples.
Solution 1
Use the function posted in the question. This function returns a tuple, therefore we need to use output_core_dims
even though the shape of the arrays is not modified. As there are actually no extra dims, we'll pass an empty list per output:
xr.apply_ufunc(
f,
ds,
output_core_dims= [[] for _ in range(4)],
)
This will return a tuple of DataArrays, its output would be exactly the same as f(ds)
.
Solution 2
We'll now modify the function to output a single array, stacking all 4 outputs in the tuple. Note that we have to make sure that this new dimension is added at the end of the array:
def f2(x):
return np.stack((x, x**2, x**3, x**4), axis=-1)
xr.apply_ufunc(
f2,
ds,
output_core_dims= [["predictions"]],
)
# Output
<xarray.DataArray (lon: 10, lat: 10, time: 30, predictions: 4)>
array([[[[ 2.49011374e+00, 6.20066642e+00, 1.54403646e+01,
...,
4.71259686e+00]]]])
Coordinates:
* lon (lon) int64 50515253545556575859
* lat (lat) int64 10111213141516171819
* time (time) int64 101112131415161718 ... 3233343536373839
Dimensions without coordinates: predictions
We have now passed predictions
as output core dim which makes the output have predictions
as a new dimension in addition to the original 3. Here the output is not equivalent to f2(ds)
(it returns a numpy array) anymore because thanks to using apply_ufunc
we have been able to perform several functions and stacking without loosing the labels.
Side note: it is generally not recommended to use mutable objects as defaults arguments in functions: see for example "Least Astonishment" and the Mutable Default Argument
Post a Comment for "How To Apply A Xarray U_function Over Netcdf And Return A 2d-array (multiple New Variables) To The Dataset"