Run openEO R UDFs from Python

These details have not been verified by PyPI

Project links

Project description

openeo-udf-python-to-r / openeo-r-udf

This is an experimental engine for openEO to run R UDFs from an R environment.

This currently is limited to R UDFs that are running without any other processes in the following processes:

apply
reduce_dimension

This repository contains the following content:

The script to run for testing is tests/test.py.
The folder tests/udfs contains UDF examples as users could provide them.
udf_lib.py is a Python library with the Python code required to run R UDFs from Python
executor.R is the R script that is run from R and executes the R UDF in the Python environment.
docker/ is the folder containing a docker image.

The following image shows how the implementation roughly works: Workflow

Install from pypi

This is for back-end developers or end-users that want to test their UDFs locally

You can install this library from pypi: pip install openeo-r-udf

You can then import the UDF library from Python: from openeo_r_udf.udf_lib import execute_udf

Afterwards, you can call the UDF library in Python as follows: execute_udf(process, udf, udf_folder, data, dimension, context, parallelize, chunk_size)

The following parameters are available:

process (string - The parent process, i.e. apply or reduce_dimension)
udf (string - The content of the parameter udf from run_udf, i.e. UDF code or a path/URL to a UDF)
udf_folder (string - The folder where the UDFs reside or should be written to)
data (xarray.DataArray - The data to process)
dimension (string, defaults to None - The dimension to work on if applicable, doesn't apply for apply)
context (Any, defaults to None - The data that has been passed in the context parameter)
parallelize (experimental, boolean, defaults to False - Enables or disables parallelization)
chunk_size (experimental, integer, defaults to 1000 - Chunk size for parallelization)

Writing a UDF

This is for end-users

A UDF must be written differently depending on where it is executed. The underlying library used for data handling is always stars.

apply

A UDF that is executed inside the process apply manipulates the values on a per-pixel basis. You can't add or remove labels or dimensions.

The UDF function must be named udf and receives two parameters:

x is a multi-dimensional stars object and you can run vectorized functions on a per pixel basis, e.g. abs.
context passes through the data that has been passed to the context parameter of the parent process (here: apply). If nothing has been provided explicitly, the parameter is set to NULL.

This can be used to pass through configurable options, parameters or some additional data. For example, if you would execute apply(process = run_udf(...), context = list(m = -1, max = -100)) then you could access the corresponding values in the UDF below as context$m and context$max (see example below).

The UDF must return a stars object with exactly the same shape.

Example:

udf = function(x, context) {
  max(abs(x) * context$a, context$max)
}

reduce_dimension

A UDF that is executed inside the process reduce_dimension takes all values along a dimension and computes a single value for it. This could for example compute an average for a timeseries.

There are two different variants of UDFs that can be used as reducers for reduce_dimension. A reducer can be executed either "vectorized" or "per chunk".

vectorized

The vectorized variant is usually the more efficient variant as it's executed once on a larger chunk of the data cube.

The UDF function must be named udf and receives two parameters:

data is a list of lists of values that you can run vectorized functions on a per pixel basis, e.g. pmax.
context -> see the description of context for apply.

The UDF must return a list of values.

Example:

Please note that you may need to use do.call to execute functions in a vectorized way. We also need to use pmax for this, instead of max.

udf = function(data, context) {
  # To get the labels for the values once:
  # labels = names(data)
  do.call(pmax, data) * context
}

The input data may look like this if you reduce along a band dimension with three bands r, g and b:

data could be list(r = c(1, 2, 6), g = c(3, 4, 5), b = c(7, 1, 0))
names(data) would be c("r", "g", "b")
Exeucting the example above would return c(7, 4, 6)

per chunk

This variant is usually slower, but might be required for certain use cases. It is executed multiple times on the smallest chunk possible for the dimension given, e.g. a single time series.

The UDF function must be named udf_chunked and receives two parameters:

data is a list of values, e.g. a single time series which you could pass to max or mean.
context -> see the description of context for apply.

The UDF must return a single value.

Example:

udf_chunked = function(data, context) {
  # To get the labels for the values:
  # labels = names(data)
  max(data)
}

The input data may look like this if you reduce along a band dimension with three bands r, g and b:

data could be c(1, 2, 3)
names(data) would be c("r", "g", "b")
Exeucting the example above would return 3

Setup and Teardown

As udf_chunked is usually executed many times in a row, you can optionally define two additional functions that are executed once before and once after the execution. These functions must be named udf_setup and udf_teardown and be placed in the same file as udf_chunked. udf_setup could be useful to initially load some data, e.g. a machine learning (ML) model. udf_teardown could be used to clean-up stuff that has been opened in udf_setup.

Both functions receive a single parameter, which is the context parameter explained above. Here the context parameter could contain the path to a ML model file, for example. By using the context parameter, you can avoid hard-coding information, which helps to make UDFs reusable.

Example:

udf_setup = function(context) {
  # e.g. load a ML model from a file
}

udf_chunked = function(data, context) {
  max(data)
}

udf_teardown = function(context) {
  # e.g. clean-up tasks
}

Note: udf_teardown is only executed if none of the udf_chunked calls has resulted in an error.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

Nov 3, 2022

0.4.0

Oct 27, 2022

0.3.0

Oct 12, 2022

0.2.1

Sep 9, 2022

0.2.0

Aug 4, 2022

This version

0.1.1

Jul 13, 2022

0.1.0

Jul 7, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openeo_r_udf-0.1.1.tar.gz (6.4 kB view details)

Uploaded Jul 13, 2022 Source

Built Distribution

openeo_r_udf-0.1.1-py3-none-any.whl (6.6 kB view details)

Uploaded Jul 13, 2022 Python 3

File details

Details for the file openeo_r_udf-0.1.1.tar.gz.

File metadata

Download URL: openeo_r_udf-0.1.1.tar.gz
Upload date: Jul 13, 2022
Size: 6.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.7.12

File hashes

Hashes for openeo_r_udf-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ed1c108a826df3ff7f67eb370089b4b428eeb99b10ece92989996ccf24d5c7c9`
MD5	`275398e579ef22a497b3294d4e6e97c4`
BLAKE2b-256	`b94e5185a39ac83e86e2a3fe654b3b5d4e30073119cab97478bb272829d14836`

See more details on using hashes here.

File details

Details for the file openeo_r_udf-0.1.1-py3-none-any.whl.

File metadata

Download URL: openeo_r_udf-0.1.1-py3-none-any.whl
Upload date: Jul 13, 2022
Size: 6.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.7.12

File hashes

Hashes for openeo_r_udf-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6d762029e93c4813558c43157e2155038860064e342d898bc74cab6aa3f24327`
MD5	`0f41f36c41a996656ccf7d29dcad028b`
BLAKE2b-256	`8e41fc8d026295c4631df184a10b258bb6e0f6d22bddba82777294efcfddb9b1`