Skip to main content

Functions to make reference descriptions for ReferenceFileSystem

Project description

fsspec-reference-maker

Functions to make reference descriptions for ReferenceFileSystem

Docs Tests

Version 0

Prototype spec for the structure required by ReferenceFileSystem:

{
  "key0": "data",
  "key1": ["protocol://target_url", 10000, 100]
}

where:

  • key0 includes data as-is (stored as text)
  • key1 refers to a data file URL, the offset within the file (in bytes), and the length of the data item (in bytes).

For example, Zarr data in this proposed spec might be represented as:

{
  ".zgroup": "{\n    \"zarr_format\": 2\n}",
  ".zattrs": "{\n    \"Conventions\": \"UGRID-0.9.0\n\"}",
  "x/.zattrs": "{\n    \"_ARRAY_DIMENSIONS\": [\n        \"node\"\n ...",
  "x/.zarray": "{\n    \"chunks\": [\n        9228245\n    ],\n    \"compressor\": null,\n    \"dtype\": \"<f8\",\n  ...",
  "x/0": ["s3://bucket/path/file.nc", 294094376, 73825960]
}

Version 1

Metadata structure in JSON. We note, for future possible binary storage, that "version", "gen" and "templates" should be considered attributes, and "refs" as the data that ought to dominate the storage size. The previous definition, Version 0, is compatible with the "refs" entry, but here we add features. It will also be possible to expand this new enhanced spec into Version 0 format.

{
  "version": (required, must be equal to) 1,
  "templates": (optional, zero or more arbitrary keys) {
    "template_name": jinja-str
  },
  "gen": (optional, zero or more items) [
    "key": (required) jinja-str,
    "url": (required) jinja-str,
    "offset": (optional, required with "length") jinja-str,
    "length": (optional, required with "offset") jinja-str,
    "dimensions": (required, one or more arbitrary keys) {
      "variable_name": (required) 
        {"start": (optional) int, "stop": (required) int, "step": (optional) int}
        OR
        [int, ...]
    }
  ],
  "refs": (optional, zero or more arbitrary keys) {
    "key_name": (required) str OR [url(jinja-str)] OR [url(jinja-str), offset(int), length(int)]
  }
}

Where:

  • jinja-str is a string which will be rendered by jinja2 or its non-python equivalent; i.e., it may be a literal string, or may include "{{..}}" annotations, where
    • for the values associated with a template_name, the variables are to be passed in reference URL strings that use this template
    • for the values within a "gen" object, variables come from the "dimensions" and "templates"
  • the str format of a reference value may be
    • a string starting "base64:", which will be decoded to binary
    • any other string, interpreted as ascii data
  • the str version of ref values indicates data, the one-element array a whole url, and the three-element version a binary section of a url

Here is an example

{
    "version": 1,
    "templates": {
        "u": "server.domain/path",
        "f": "{{c}}"
    },
    "gen": [
        {
            "key": "gen_key{{i}}",
            "url": "http://{{u}}_{{i}}",
            "offset": "{{(i + 1) * 1000}}",
            "length": "1000",
            "dimensions": 
              {
                "i": {"stop":  5}
              }
        }   
    ],
    "refs": {
      "key0": "data",
      "key1": ["http://target_url", 10000, 100],
      "key2": ["http://{{u}}", 10000, 100],
      "key3": ["http://{{f(c='text')}}", 10000, 100]
    }
}

Here the variable i takes the values [0, 1, 2, 3, 4], which could have been provided in array form. Where there is more than one variable, a cartesian product is formed.

This example evaluates to the Version 0 equivalent

{
  "key0": "data",
  "key1": ["http://target_url", 10000, 100],
  "key2": ["http://server.domain/path", 10000, 100],
  "key3": ["http://text", 10000, 100],
  "gen_key0": ["http://server.domain/path_0", 1000, 1000],
  "gen_key1": ["http://server.domain/path_1", 2000, 1000],
  "gen_key2": ["http://server.domain/path_2", 3000, 1000],
  "gen_key3": ["http://server.domain/path_3", 4000, 1000],
  "gen_key4": ["http://server.domain/path_4", 5000, 1000]
}

such that accessing, for instance, "key0" returns b"data" and accessing "gen_key0" returns 1000 bytes from the given URL, at an offset of 1000.

Examples

Run a notebook example comparing reading HDF5 using this approach vs. native Zarr format:
Binder

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fsspec-reference-maker-0.0.3.tar.gz (33.8 kB view details)

Uploaded Source

Built Distribution

fsspec_reference_maker-0.0.3-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file fsspec-reference-maker-0.0.3.tar.gz.

File metadata

  • Download URL: fsspec-reference-maker-0.0.3.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for fsspec-reference-maker-0.0.3.tar.gz
Algorithm Hash digest
SHA256 b40f88718f93508135a8b9e839c0d7880317080a6a26f1128d610e0bc8952f3b
MD5 0c6a8d28c627a17bd71287dc897c5fbe
BLAKE2b-256 ecc5c74ffb65ad92a2228bf5bc71edda674c2cd600d0e5394fd651a353de3804

See more details on using hashes here.

File details

Details for the file fsspec_reference_maker-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: fsspec_reference_maker-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 17.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.0 pkginfo/1.7.0 requests/2.24.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.8

File hashes

Hashes for fsspec_reference_maker-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 939d80256647dbf283cf0d400a2c1a77bbf713fc3fdbf5da809991fa5a224a19
MD5 ecc0ff51e3695261a14446408d7bcb84
BLAKE2b-256 ac06efe95ccc4a1674e62f0e18861c846c439093ce6fc860bc3d34220ee9af7b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page