Skip to main content

On-the-fly operations on geographical maps.

Project description

dask-geomodeling

Dask-geomodeling is a collection of classes that are to be stacked together to create configurations for on-the-fly operations on geographical maps. By generating Dask compute graphs, these operation may be parallelized and (intermediate) results may be cached.

Multiple Block instances together make a view. Each Block has the get_data method that fetches the data in one go, as well as a get_compute_graph method that creates a graph to compute the data later.

Constructing a view

A dask-geomodeling view can be constructed by creating a Block instance:

from dask_geomodeling.raster import RasterFileSource
source = RasterFileSource('/path/to/geotiff')

The view can now be used to obtain data from the specified file. More complex views can be created by nesting block instances:

from dask_geomodeling.raster import Add, Multiply
add = Add(source, 2.4)
mult = Multiply(source, view_add)

Obtaining data from a view

Dask-geomodeling revolves around lazy data evaluation. Each Block first evaluates what needs to be done for certain request, storing that in a compute graph. This graph can then be evaluated to obtain the data. The data is evaluated with dask, and the specification of the compute graph also comes from dask. For more information about how a graph works, consult the dask docs:

We use the previous example to demonstrate how this works:

import dask
request = {
    "mode": "vals",
    "bbox": (138000, 480000, 139000, 481000),
    "projection": "epsg:28992",
    "width": 256,
    "height": 256
}
compute_graph, compute_token = add.get_compute_graph(**request)
data = dask.get(compute_graph, compute_token)

Here, we first generate a compute graph using dask-geomodeling, then evaluate the graph using dask. The power of this two-step procedure is twofold:

  1. Dask supports multi-threading, multi-processing, and cluster processing.

  2. The compute_token is a unique identifier of this computation: this can easily be used in caching methods.

The Block class

To write a new geoblock class, we need to write the following:

  1. the __init__ that validates the arguments when constructing the block

  2. the get_sources_and_requests that processes the request

  3. the process that processes the data

  4. a number of attribute properties such as extent and period

About the 2-step data processing

The get_sources_and_requests method of any block is called recursively from get_compute_graph and feeds the request from the block to its sources. It does so by returning a list of (source, request) tuples. During the data evaluation each of these 2-tuples will be converted to a single data object which is supplied to the process function.

An example in words. We ask the add block from the previous example to do the following:

  • give me a 256x256 raster at location (138000, 480000)

The get_sources_and_requests would respond with the following:

  • I need a256x256 raster at location (138000, 480000) from RasterFileSource('/path/to/geotiff')

  • I need the number 2.4

The get_compute_graph method works recursively, so it also calls the get_sources_and_requests of the RasterStoreSource. The result is a dask compute graph.

When this compute graph is evaluated, the process method of the add geoblock will ultimately receive two arguments:

  • the 256x256 raster from RasterFileSource('/path/to/geotiff')

  • the number 2.4

And the process method produces the end result.

Implementation example

As an example, we use a simplified Dilate geoblock, which adds a buffer of 1 pixel around all pixels of given value:

class Dilate(RasterBlock):
    def __init__(self, store, values):
        if not isinstance(store, RasterBlock):
            raise TypeError("'{}' object is not allowed".format(type(store)))
        values = np.asarray(values, dtype=store.dtype)
        super(Dilate, self).__init__(store, values)

    @property
    def store(self):
        return self.args[0]

    @property
    def values(self):
        return self.args[1]

    def get_sources_and_requests(self, **request):
        new_request = expand_request_pixels(request, radius=1)
        return [(self.store, new_request), (self.values, None)]

    @staticmethod
    def process(data, values=None):
        if data is None or values is None or 'values' not in data:
            return data
        original = data['values']
        dilated = original.copy()
        for value in values:
            dilated[ndimage.binary_dilation(original == value)] = value
        dilated = dilated[:, 1:-1, 1:-1]
        return {'values': dilated, 'no_data_value': data['no_data_value']}

    @property
    def extent(self):
        return self.store.extent

    @property
    def period(self):
        return self.store.period

In this example, we see all the essentials of a geoblock implementation.

  • The __init__ checks the types of the provided arguments and calls the super().__init__ that further initializes the geoblock.

  • The get_sources_and_requests expands the request with 1 pixel, so that dilation will have no edge effects. It returns two (source, request) tuples.

  • The process (static)method takes the amount arguments that get_sources_and_requests produces. It does the actual work and returns a data response.

  • Some attributes like extent and period need manual specification, as they might change through the geoblock.

  • The class derives from RasterBlock, which sets the type of geoblock, and through that its request/response schema and its required attributes.

Block types

A block type sets three things:

  1. the response schema: e.g. “RasterBlock.process returns a dictionary with a numpy array and a no data value”

  2. the request schema: e.g. “RasterBlock.get_sources_and_requests expects a dictionary with the fields ‘mode’, ‘bbox’, ‘projection’, ‘height’, ‘width’”

  3. the attributes to be implemented on each geoblock

This is not enforced at the code level, it is up to the developer to stick to this specification. The specification is written down in the type baseclass “RasterBlock”, “GeometryBlock”, etc.

Local setup (for development)

These instructions assume that git, python3, pip, and virtualenv are installed on your host machine.

First make sure you have the GDAL libraries installed. On Ubuntu:

$ sudo apt install libgdal-dev

Take note the GDAL version:

$ apt show libgdal-dev

Create and activate a virtualenv:

$ virtualenv --python=python3 .venv
$ source .venv/bin/activate

Install PyGDAL with the correct version (example assumes GDAL 2.2.3):

$ pip install pygdal==2.2.3.*

Install dask-geomodeling:

$ pip install -e .[test]

Run the tests:

$ pytest

Or optionally, with coverage and code style checking:

$ pytest --cov=dask_geomodeling --black

Changelog of dask-geomodeling

2.0.2 (2019-09-04)

  • Clean up the .check() method for RasterBlocks.

  • Added a Travisfile testing with against versions since 2017 on Linux and OSX.

  • Took some python 3.5 compatibility measures.

  • Added fix in ParseText block for pandas 0.23.

  • Changed underscores in config to dashes for dask 0.18 compatibility.

  • Constrained dask to >= 0.18, numpy to >= 1.12, pandas to >= 0.19, geopandas to >= 0.4, scipy to >= 0.19.

  • Removed the explicit (py)gdal dependency.

2.0.1 (2019-08-30)

  • Renamed the package to dask-geomodeling.

  • Integrated the settings with dask.config.

  • Added BSD 3-Clause license.

2.0.0 (2019-08-27)

  • Remove raster-store dependency.

  • Removed RasterStoreSource, ThreediResultSource, Result, Interpolate, DeprecatedInterpolate, GeoInterface, and GroupTemporal geoblocks.

  • Removed all django blocks GeoDjangoSource, AddDjangoFields, GeoDjangoSink.

  • Simplified tokenization of Block objects.

  • Implemented construct_multiple to construct multiple blocks at once.

  • Implemented MemorySource and GeoTIFFSource as new raster sources.

  • Add Cumulative geoblock for performing temporal cumulatives.

1.2.13 (2019-08-20)

  • Add TemporalAggregate geoblock for performing temporal aggregates on raster data.

  • Fix raster math geoblocks to not have byte-sized integers ‘wrap around’ when they are added. All integer-types are now at least int32 and all float types at least float32.

1.2.12 (2019-07-30)

  • Made GeoDjangoSource backwards compatible with existing graph definitions.

  • Fix Interpolate wrapper.

1.2.11 (2019-07-19)

  • Added new parameter filters to GeoDjangoSource.

1.2.10 (2019-07-05)

  • Classify block return single series with dtype of labels if labels are floats or integers.

1.2.9 (2019-06-29)

  • Fix bug introduced in tokenization fix.

1.2.8 (2019-06-29)

  • Skip tokenization if a block was already tokenized.

1.2.7 (2019-06-28)

  • Implemented AggregateRasterAboveThreshold.

1.2.6 (2019-06-27)

  • Fix in ParseTextColumn for empty column description.

  • Fix empty dataset case in ClassifyFromColumns.

1.2.5 (2019-06-26)

  • Skip (costly) call to tokenize() when constructing without validation. If a graph was supplied that was generated by geoblocks, the token should be present in the name. If the name has incorrect format, a warning is emitted and tokenize() is called after all.

  • Deal with empty datasets in ClassifyFromColumns.

1.2.4 (2019-06-21)

  • Updated ParseTextColumn: allow spaces in values.

1.2.3 (2019-06-21)

  • Rasterize geoblock has a limit of 10000 geometries.

  • Implemented Choose geoblock for Series.

  • Added the block key in the exception message when construction failed.

  • Added caching to get_compute_graph to speedup graph generation.

  • Improved the documentation.

1.2.2 (2019-06-13)

  • Fix tokenization of a geoblock when constructing with validate=False.

  • The raster requests generated in AggregateRaster have their bbox now snapped to (0, 0) for better reproducibility.

1.2.1 (2019-06-12)

  • Fix bug in geoblocks.geometry.constructive.Buffer that was introduced in 1.2.

1.2 (2019-06-12)

  • Extend geometry.field_operations.Classify for classification outside of the bins. For example, you can now supply 2 bins and 3 labels.

  • Implemented geometry.field_operations.ClassifyFromColumns that takes its bins from columns in a GeometryBlock, so that classification can differ per feature.

  • Extend geometry.base.SetSeriesBlock to setting constant values.

  • Implemented geometry.field_operations.Interp.

  • Implemented geometry.text.ParseTextColumn that parses a text column into multiple value columns.

  • AddDjangoFields converts columns to Categorical dtype automatically if the data is of ‘object’ dtype (e.g. strings). This makes the memory footprint of large text fields much smaller.

  • Make validation of a graph optional when constructing.

  • Use dask.get in construct and compute as to not doubly construct/compute.

  • Fix bug in geoblocks.geometry.constructive.Buffer that changed the compute graph inplace, prohibiting 2 computations of the same graph.

1.1 (2019-06-03)

  • GeoDjangoSink returns a dataframe with the ‘saved’ column indicating whether the save succeeded. IntegrityErrors result in saved=False.

  • Added projection argument to GeometryTiler. The GeometryTiler only accepts requests that have a projection equal to the tiling projection.

  • Raise a RuntimeError if the amount of returned geometries by GeoDjangoSource exceeds the GEOMETRY_LIMIT setting.

  • Added auto_pixel_size argument to geometry.AggregateRaster. If this is False, the process raises a RuntimeError when the required raster exceeds the max_size argument.

  • If max_size in the geometry.AggregateRaster is None, it defaults to the global RASTER_LIMIT setting.

  • Remove the index_field_name argument in GeoDjangoSource, instead obtain it automatically from model._meta.pk.name. The index can be added as a normal column by including it in ‘fields’.

  • Change the default behaviour of ‘fields’ in GeoDjangoSource: if not given, no extra fields are included. Also start and end field names are not included.

  • Added the ‘columns’ attribute to all geometry blocks except for the GeometryFileSource.

  • Added tests for SetSeriesBlock and GetSeriesBlock.

  • Added check that column exist in GetSeriesBlock, AddDjangoFields and GeoDjangoSink.

  • Implemented Round geoblock for Series.

  • Fixed AggregateRaster when aggregating in a different projection than the request projection.

  • Allow GeometryTiler to tile in a different projection than the request geometry is using.

1.0 (2019-05-09)

  • Improved GeoDjangoSink docstring + fixed bug.

  • Bug fix in GeoInterface for handling inf values.

  • Added Area Geoblock for area calculation in Geometry blocks.

  • Added MergeGeometryBlocks for merge operation between GeoDataFrames.

  • Added GeometryBlock.__getitem__ `and `GeometryBlock.set, getting single columns from and setting multiple columns to a GeometryBlock. Corresponding geoblocks are geometry.GetSeriesBlock and geometry.SetSeriesBlock.

  • Added basic operations for add,`sub`,`mul`,`div`,`truediv`,`floordiv`, mod, eq,`neq`,`ge`,`gt`,`le`,`lt`, and, or, xor and not operation in SeriesBlocks.

  • Documented the request and response protocol for GeometryBlock.

  • Added a tokenizer for shapely geometries, so that GeometryBlock request hashes are deterministic.

  • Added a tokenizer for datetime and timedelta objects.

  • Added geopandas dependency.

  • Removed GeoJSONSource and implemented GeometryFileSource. This new reader has no simplify and intersect functions.

  • Implemented geometry.set_operations.Intersection.

  • Implemented geometry.constructive.Simplify.

  • Adjusted the MockGeometry test class.

  • Reimplemented utils.rasterize_geoseries and fixed raster.Rasterize.

  • Reimplemented geometry.AggregateRaster.

  • Fixed time requests for 3Di Result geoblocks that are outside the range of the dataset

  • Implemented geometry.GeoDjangoSource.

  • Implemented geometry.GeoDjangoSink.

  • Added support for overlapping geometries when aggregating.

  • Increased performance of GeoSeries coordinate transformations.

  • Fixed inconsistent naming of the extent-type geometry response.

  • Consistently return an empty geodataframe in case there are no geometries.

  • Implemented geometry.Difference.

  • Implemented geometry.Classify.

  • Implemented percentile statistic for geometry.AggregateRaster.

  • Implemented geometry.GeometryTiler.

  • Explicitly set the result column name for AggregateRaster (default: ‘agg’).

  • Implemented count statistic for geometry.AggregateRaster.

  • Implemented geometry.AddDjangoFields.

  • Added temporal filtering for Django geometry sources.

  • Allow boolean masks in raster.Clip.

  • Implemented raster.IsData.

  • Implemented geometry.Where and geometry.Mask.

  • Extended raster.Rasterize to rasterize float, int and bool properties.

  • Fixed bug in Rasterize that set ‘min_size’ wrong.

0.6 (2019-01-18)

  • Coerce the geo_transform to a list of floats in the raster.Interpolate, preventing TypeErrors in case it consists of decimal.Decimal objects.

0.5 (2019-01-14)

  • Adapted path URLs to absolute paths in RasterStoreSource, GeoJSONSource, and ThreediResultSource. They still accept paths relative to the one stored in settings.

0.4 (2019-01-11)

  • The ‘store_resolution’ result field of GeoInterface now returns the resolution as integer (in milliseconds) and not as datetime.timedelta.

  • Added metadata fields to Optimizer geoblocks.

  • Propagate the union of the geometries in a Group (and Optimizer) block.

  • Propagate the intersection of the geometries in elementwise blocks.

  • Implement the projection metadata field for all blocks.

  • Fixed the Shift geoblock by storing the time shift in milliseconds instead of a datetime.timedelta, which is not JSON-serializable.

0.3 (2018-12-12)

  • Added geoblocks.raster.Classify.

  • Let the raster.Interpolate block accept the (deprecated) layout kwarg.

0.2 (2018-11-20)

  • Renamed ThreediResultSource path property to hdf5_path and fixed it.

0.1 (2018-11-19)

  • Initial project structure created.

  • Copied graphs.py, tokenize.py, wrappers.py, results.py, interfaces.py, and relevant tests and factories from raster-store.

  • Wrappers are renamed into ‘geoblocks’, which are al subclasses of Block. The wrappers were restructured into submodules core, raster, geometry, and interfaces.

  • The new geoblocks.Block baseclass now provides the infrastructure for a) describing a relational block graph and b) generating compute graphs from a request for usage in parallelized computations.

  • Each element in a relational block graph or compute graph is hashed using the tokenize module from dask which is able to generate unique and deterministic tokens (hashes).

  • Blocks are saved to a new json format (version 2).

  • Every block supports the attributes period, timedelta, extent, dtype, fillvalue, geometry, and geo_transform.

  • The check method is implemented on every block and refreshes the primitives (stores.Store / results.Grid).

  • geoblocks.raster.sources.RasterStoreSource should now be wrapped around a raster_store.stores.Store in order to include it as a datasource inside a graph.

  • Reformatted the code using black code formatter.

  • Implemented GroupTemporal as replacement for multi-store Lizard objects.

  • Adapted GeoInterface to mimic now deprecated lizard_nxt.raster.Raster.

  • Fixed issue with ciso8601 2.*

  • Bumped raster-store dependency to 4.0.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask-geomodeling-2.0.2.tar.gz (70.0 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page