phenodata

phenodata is an acquisition and processing toolkit for open access phenology data

Project description

https://github.com/earthobservations/phenodata/actions/workflows/tests.yml/badge.svg

https://readthedocs.org/projects/phenodata/badge/

https://codecov.io/gh/earthobservations/phenodata/branch/main/graph/badge.svg

https://static.pepy.tech/badge/phenodata/month

https://img.shields.io/pypi/pyversions/phenodata.svg

phenodata

Phenology data acquisition for humans.

About

Phenodata is an acquisition and processing toolkit for open access phenology data. It is based on pandas, and can be used both as a standalone program, and as a library.

Currently, it implements data wrappers for acquiring phenology observation data published on the DWD Climate Data Center (CDC) FTP server operated by »Deutscher Wetterdienst« (DWD). Adding adapters for other phenology databases and APIs is possible and welcome.

Acknowledgements

Thanks to the many observers of »Deutscher Wetterdienst« (DWD), the »Global Phenological Monitoring programme« (GPM), and all people working behind the scenes for their commitment on recording observations and making the excellent datasets available to the community. You know who you are.

Notes

Please note that phenodata is beta-quality software, and a work in progress. Contributions of all kinds are welcome, in order to make it more solid.

Breaking changes should be expected until a 1.0 release, so version pinning is recommended, especially when you use phenodata as a library.

Synopsis

The easiest way to use phenodata, and to explore the dataset interactively, is to use its command-line interface.

Those two examples will acquire observation data from DWD’s network, only focus on the “beginning of flowering” phase event, and present the results in tabular format using values suitable for human consumption.

Acquire data from DWD’s “immediate” dataset (Sofortmelder).

phenodata observations \
    --source=dwd --dataset=immediate --partition=recent \
    --year=2023 --station=brandenburg \
    --species-preset=mellifera-de-primary \
    --phase="beginning of flowering" \
    --humanize --sort=Datum --format=rst

Acquire data from DWD’s “annual” dataset (Jahresmelder).

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --year="2022,2023" --station=berlin \
    --species-preset=mellifera-de-primary \
    --phase="beginning of flowering" \
    --humanize --sort=Datum --format=rst

Output example

Phenodata can produce output in different formats. This is a table in reStructuredText format.

Datum	Spezies	Phase	Station
2018-02-17	common snowdrop	beginning of flowering	Berlin-Dahlem, Berlin
2018-02-19	common hazel	beginning of flowering	Berlin-Dahlem, Berlin
2018-03-30	goat willow	beginning of flowering	Berlin-Dahlem, Berlin
2018-04-07	dandelion	beginning of flowering	Berlin-Dahlem, Berlin
2018-04-15	cherry (late ripeness)	beginning of flowering	Berlin-Dahlem, Berlin
2018-04-21	winter oilseed rape	beginning of flowering	Berlin-Dahlem, Berlin
2018-04-23	apple (early ripeness)	beginning of flowering	Berlin-Dahlem, Berlin
2018-05-03	apple (late ripeness)	beginning of flowering	Berlin-Dahlem, Berlin
2018-05-24	black locust	beginning of flowering	Berlin-Dahlem, Berlin
2018-08-20	common heather	beginning of flowering	Berlin-Dahlem, Berlin

Usage

Introduction

For most acquisition tasks, you will have to select one of two different datasets of DWD, annual-reporters or immediate-reporters. Further, the data partition has to be selected, it is either recent, or historical.

Currently, as of 2023, the historical datasets extend from the past until 2021. All subsequent observations are stored within the recent dataset partition.

The DWD publishes data in files separated by species, this means each plant’s data will be in a different file. By default, phenodata will acquire data for all species (plants), in order to be able to respond to all kinds of queries across the whole dataset.

If you are only interested in a limited set of species (plants), you can improve data acquisition performance by using the filename option to only select specific files for retrieval.

For example, when using --filename=Hasel,Schneegloeckchen, only file names containing Hasel or Schneegloeckchen will be retrieved, thus minimizing the effort needed to acquire all files.

Install

To install the software from PyPI, invoke:

pip install 'phenodata[sql]' --upgrade

Library use

This snippet demonstrates how to use phenodata as a library within individual programs. For ready-to-run code examples, please have a look into the examples directory.

>>> import pandas as pd
>>> from phenodata.ftp import FTPSession
>>> from phenodata.dwd.cdc import DwdCdcClient
>>> from phenodata.dwd.pheno import DwdPhenoData

>>> cdc_client = DwdCdcClient(ftp=FTPSession())
>>> client = DwdPhenoData(cdc=cdc_client, humanizer=None, dataset="immediate")
>>> options = {
...     # Select data partition.
...     "partition": "recent",
...
...     # Filter by file names and years.
...     "filename": ["Hasel", "Raps", "Mais"],
...     "year": [2018, 2019, 2020],
...
...     # Filter by station identifier.
...     "station-id": [13346]
... }

>>> observations: pd.DataFrame = client.get_observations(options, humanize=False)
>>> observations.info()
[...]
>>> observations
[...]

Command-line use

This section gives you an idea about how to use the phenodata program on the command-line.

$ phenodata --help

Usage:
  phenodata info
  phenodata list-species --source=dwd [--format=csv]
  phenodata list-phases --source=dwd [--format=csv]
  phenodata list-stations --source=dwd --dataset=immediate [--all] [--filter=berlin] [--sort=Stationsname] [--format=csv]
  phenodata nearest-station --source=dwd --dataset=immediate --latitude=52.520007 --longitude=13.404954 [--format=csv]
  phenodata nearest-stations --source=dwd --dataset=immediate --latitude=52.520007 --longitude=13.404954 [--all] [--limit=10] [--format=csv]
  phenodata list-quality-levels --source=dwd [--format=csv]
  phenodata list-quality-bytes --source=dwd [--format=csv]
  phenodata list-filenames --source=dwd --dataset=immediate --partition=recent [--filename=Hasel,Schneegloeckchen] [--year=2017]
  phenodata list-urls --source=dwd --dataset=immediate --partition=recent [--filename=Hasel,Schneegloeckchen] [--year=2017]
  phenodata (observations|forecast) --source=dwd --dataset=immediate --partition=recent [--filename=Hasel,Schneegloeckchen] [--station-id=164,717] [--species-id=113,127] [--phase-id=5] [--quality-level=10] [--quality-byte=1,2,3] [--station=berlin,brandenburg] [--species=hazel,snowdrop] [--species-preset=mellifera-de-primary] [--phase=flowering] [--quality=ROUTKLI] [--year=2017] [--forecast-year=2021] [--humanize] [--show-ids] [--language=german] [--long-station] [--sort=Datum] [--sql=sql] [--format=csv] [--verbose]
  phenodata drop-cache --source=dwd
  phenodata --version
  phenodata (-h | --help)

Data acquisition options:
  --source=<source>         Data source. Currently, only "dwd" is a valid identifier.
  --dataset=<dataset>       Data set. Use "immediate" or "annual" for "--source=dwd".
  --partition=<dataset>     Partition. Use "recent" or "historical" for "--source=dwd".
  --filename=<file>         Filter by file names (comma-separated list)

Direct filtering options:
  --year=<year>             Filter by year (comma-separated list)
  --station-id=<station-id> Filter by station identifiers (comma-separated list)
  --species-id=<species-id> Filter by species identifiers (comma-separated list)
  --phase-id=<phase-id>     Filter by phase identifiers (comma-separated list)

Humanized filtering options:
  --station=<station>       Filter by strings from "stations" data (comma-separated list)
  --species=<species>       Filter by strings from "species" data (comma-separated list)
  --phase=<phase>           Filter by strings from "phases" data (comma-separated list)
  --species-preset=<preset> Filter by strings from "species" data (comma-separated list)
                            The preset will get loaded from the "presets.json" file.

Forecasting options:
  --forecast-year=<year>    Use as designated forecast year.

Postprocess filtering options:
  --sql=<sql>               Apply given SQL query before output.

Data output options:
  --format=<format>         Output data in designated format. Choose one of "tabular", "json",
                            "csv", or "string". Use "md" for Markdown output, or "rst" for
                            reStructuredText. With "tabular:foo", it is also possible to specify
                            other tabular output formats.  [default: tabular:psql]
  --sort=<sort>             Sort by given field names. (comma-separated list)
  --humanize                Resolve identifier-based fields to human-readable labels.
  --show-ids                Show identifiers alongside resolved labels, when using "--humanize".
  --language=<language>     Use labels in designated language, when using "--humanize"
                            [default: english].
  --long-station            Use long station name including "Naturraumgruppe" and "Naturraum".
  --limit=<limit>           Limit output of "nearest-stations" to designated number of entries.
                            [default: 10]
  --verbose                 Turn on verbose output.

Examples

The best way to explore phenodata is by running a few example invocations.

The “Metadata” section will walk you through different commands which can be used to inquire information about monitoring stations/sites, and to list the actual files which will be acquired, in order to learn about data lineage.
The “Observations” section will demonstrate command examples to acquire, process, and format actual observation data.

Metadata

Display list of species, with their German, English, and Latin names:

phenodata list-species --source=dwd

Display list of phases, with their German and English names:

phenodata list-phases --source=dwd

List of all reporting/monitoring stations:

phenodata list-stations --source=dwd --dataset=immediate

List of stations, with filtering:

phenodata list-stations --source=dwd --dataset=annual --filter="Fränkische Alb"

Display nearest station for given position:

phenodata nearest-station --source=dwd --dataset=immediate \
    --latitude=52.520007 --longitude=13.404954

Display 20 nearest stations for given position:

phenodata nearest-stations \
    --source=dwd --dataset=immediate \
    --latitude=52.520007 --longitude=13.404954 --limit=20

List of file names of recent observations by the annual reporters:

phenodata list-filenames \
    --source=dwd --dataset=annual --partition=recent

Same as above, but with filtering by file name:

phenodata list-filenames \
    --source=dwd --dataset=annual --partition=recent \
    --filename=Hasel,Kornelkirsche,Loewenzahn,Schneegloeckchen

List full URLs instead of only file names:

phenodata list-urls \
    --source=dwd --dataset=annual --partition=recent \
    --filename=Hasel,Kornelkirsche,Loewenzahn,Schneegloeckchen

Observations

Basic

Observations of hazel and snowdrop, using filename-based filtering at data acquisition time:

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --filename=Hasel,Schneegloeckchen

Observations of hazel and snowdrop (dito), but for specific station identifiers:

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --filename=Hasel,Schneegloeckchen --station-id=7521,7532

All observations for specific station identifiers and specific years:

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --station-id=7521,7532 --year=2020,2021

All observations for specific station and species identifiers:

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --station-id=7521,7532 --species-id=113,127

All observations marked as invalid:

phenodata list-quality-bytes --source=dwd
phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --quality-byte=5,6,7,8

Humanized output

The option --humanize will improve textual output by resolving identifier fields to appropriate human-readable text labels.

Observations for species “hazel”, “snowdrop”, “apple” and “pear” at station “Berlin-Dahlem”, output texts in the German language, if possible:

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --filename=Hasel,Schneegloeckchen,Apfel,Birne \
    --station-id=12132 \
    --humanize \
    --language=german

Humanized search

When using the --humanize option, you can use the non-identifier-based filtering options --station, --species, and --phase, to use human-readable text labels for filtering instead of numeric identifiers.

Query observations by using real-world location names:

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --filename=Hasel,Schneegloeckchen \
    --station=berlin,brandenburg \
    --humanize --sort=Datum

Query observations near Munich with species names “hazel” and “snowdrop” in specific year:

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --station=münchen \
    --species=hazel,snowdrop \
    --year=2022 \
    --humanize --sort=Datum

Now, let’s query for any “flowering” observations. There will be beginning of flowering, general flowering, and end of flowering:

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --station=münchen \
    --phase=flowering \
    --year=2022 \
    --humanize --sort=Datum

Same observations as before but with ROUTKLI quality marker:

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --station=münchen \
    --phase=flowering \
    --quality="nicht beanstandet" \
    --year=2022 \
    --humanize --sort=Datum

Now, let’s inquire those field values which have seen corrections instead (Feldwert korrigiert):

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --station=münchen \
    --phase=flowering \
    --quality=korrigiert \
    --humanize --sort=Datum

Filtering with presets

When using the --humanize option, you can also use pre-defined shortcuts for lists of species by name. For example, the mellifera-de-primary preset is defined within the presets.json file like:

Hasel, Schneeglöckchen, Sal-Weide, Löwenzahn, Süßkirsche, Apfel, Winterraps, Robinie, Winter-Linde, Heidekraut

Then, you can use the option --species-preset=mellifera-de-primary instead of the --species option for filtering only those specified species.

This example lists all “beginning of flowering” observations for the specified years in Köln, only for the named list of species mellifera-de-primary. The result will be sorted by species and date, and human-readable labels will be displayed in German, when possible:

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --phase="beginning of flowering" \
    --year=2021,2022,2023 \
    --station=köln \
    --species-preset=mellifera-de-primary \
    --humanize --language=german --sort=Spezies,Datum

Filtering with SQL

Phenodata uses the DuckDB Python API to let you directly query the pandas DataFrame produced by the data acquisition subsystem. This example uses an SQL statement to filter the results by station name, and sort them by date:

phenodata observations \
    --source=dwd --dataset=annual --partition=recent \
    --year=2019,2020,2021,2022,2023 \
    --species-preset=mellifera-de-primary --phase="beginning of flowering" \
    --humanize --language=german \
    --sql="SELECT * FROM data WHERE Station LIKE '%Berlin%' ORDER BY Datum" \
    --format=md

Project information

Resources

Contributions

If you would like to contribute, you are most welcome. Spend some time taking a look around, locate a bug, design issue or spelling mistake and then send us a pull request or create an issue. Thank you in advance for your efforts, the authors really appreciate any kind of help and feedback.

Discussions

Discussions around the development of phenodata and its applications are taking place at the Hiveeyes forum. Enjoy reading them, and don’t hesitate to write in, if you think you may be able to contribute a thing or another, or to share what you have been doing with the program in form of a “show and tell” post.

Development

In order to setup a development environment on your workstation, please head over to the development sandbox documentation. When you see the software tests succeed, you should be ready to start hacking.

Code license

The project is licensed under the terms of the GNU AGPL license, see LICENSE.

Data license

The DWD has information about their data re-use policy in German and English. Please refer to the respective Disclaimer (de, en) and Copyright (de, en) information.

Disclaimer

The project and its authors are not affiliated with DWD, GPM, USA-NPN, or any other organization in any way. It is a sole project conceived by the community, in order to make data more accessible, in the spirit of open data and open scientific data. The authors believe the world would be a better place if public data could be loaded into pandas dataframes and Xarray datasets easily.

Project details

Release history Release notifications | RSS feed

0.13.1

Apr 11, 2023

This version

0.13.0

Apr 11, 2023

0.12.0

Apr 11, 2023

0.11.0

Dec 29, 2020

0.10.0

Dec 29, 2020

0.9.4

Dec 29, 2020

0.9.3

Dec 28, 2020

0.9.2

Dec 28, 2020

0.9.1

Jan 7, 2020

0.9.0

Jan 7, 2020

0.8.0

Mar 29, 2018

0.7.0

Mar 29, 2018

0.6.5

Mar 28, 2018

0.6.4

Mar 15, 2018

0.6.3

Mar 14, 2018

0.6.2

Mar 14, 2018

0.6.1

Mar 14, 2018

0.6.0

Mar 14, 2018

0.5.0

Mar 14, 2018

0.4.0

Mar 14, 2018

0.3.0

Mar 13, 2018

0.2.0

Mar 12, 2018

0.1.0

Mar 12, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phenodata-0.13.0.tar.gz (47.0 kB view hashes)

Uploaded Apr 11, 2023 Source

Built Distribution

phenodata-0.13.0-py3-none-any.whl (43.0 kB view hashes)

Uploaded Apr 11, 2023 Python 3

Hashes for phenodata-0.13.0.tar.gz

Hashes for phenodata-0.13.0.tar.gz
Algorithm	Hash digest
SHA256	`344d6f786538d1bfe3547a0358b225c8b5627ad2696d60a62d76ff98246891c8`
MD5	`ef83ddc4a3bd19dd79222d6efd582f7c`
BLAKE2b-256	`a2e80b17a64d6cee0b7f2304e75d71edbf72588479e0dce47e271fd1d68cbb08`

Hashes for phenodata-0.13.0-py3-none-any.whl

Hashes for phenodata-0.13.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4e978d44b5efecd1093c55355ade1b91b3cc043f2172e54b84291a87ab91e3af`
MD5	`0407a4c9cd9d0a499e4397dca7b9457a`
BLAKE2b-256	`db416bdc8749e18c96d97a9fe472a0b057993ba008bba94c9ce391c08c3862a2`

phenodata 0.13.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

phenodata

About

Acknowledgements

Notes

Synopsis

Output example

Usage

Introduction

Install

Library use

Command-line use

Examples

Metadata

Observations

Basic

Humanized output

Humanized search

Filtering with presets

Filtering with SQL

Project information

Resources

Contributions

Discussions

Development

Code license

Data license

Disclaimer

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution