Skip to main content

Out-of-core NumPy arrays

Project description

Wendelin.core allows you to work with arrays bigger than RAM and local disk. Bigarrays are persisted to storage, and can be changed in transactional manner.

In other words bigarrays are something like numpy.memmap for numpy.ndarray and OS files, but support transactions and files bigger than disk. The whole bigarray cannot generally be used as a drop-in replacement for numpy arrays, but bigarray slices are real ndarrays and can be used everywhere ndarray can be used, including in C/Cython/Fortran code. Slice size is limited by virtual address-space size, which is ~ max 127TB on Linux/amd64.

The main class to work with is ZBigArray and is used like ndarray from NumPy:

  1. create array:

    from wendelin.bigarray.array_zodb import ZBigArray
    import transaction
    
    # root is connected to opened database
    root['A'] = A = ZBigArray(shape=..., dtype=...)
    transaction.commit()
  2. view array as a real ndarray:

    a = A[:]        # view which covers all array, if it fits into address-space
    b = A[10:100]

    data for views will be loaded lazily on memory access.

  3. work with views, including using C/Cython/Fortran functions from NumPy and other libraries to read/modify data:

    a[2] = 1
    a[10:20] = numpy.arange(10)
    numpy.mean(a)
    the amount of modifications in one transaction should be less than available RAM.
    the amount of data read is limited only by virtual address-space size.
  4. data can be appended to array in O(δ) time:

    values                  # ndarray to append of shape  (δ,)
    A.append(values)

    and array itself can be resized in O(1) time:

    A.resize(newshape)
  5. changes to array data can be either discarded or saved back to DB:

    transaction.abort()     # discard all made changes
    transaction.commit()    # atomically save all changes

When using NEO or ZEO as a database, bigarrays can be simultaneously used by several nodes in a cluster.

Please see demo/demo_zbigarray.py for a complete example.

Current state and Roadmap

Wendelin.core works in real life for workloads Nexedi is using in production, including 24/7 projects. We are, however, aware of the following limitations and things that need to be improved:

  • wendelin.core is currently not very fast

  • there are big - proportional to input in size - temporary array allocations in third-party libraries (NumPy, scikit-learn, …) which might practically prevent processing out-of-core arrays depending on the functionality used.

Thus

  • we are currently working on improved wendelin.core design and implementation, which uses kernel virtual memory manager (complemented by one implemented in userspace) with arrays backend presented to kernel via FUSE as virtual filesystem implemented in Go.

    As of 2021 November this filesystem reached its alpha state and is staged to be tried for real.

In parallel we will also:

  • try wendelin.core 1.0 on large data sets

  • identify and incrementally fix big-temporaries allocation issues in NumPy and scikit-learn

We are open to community help with the above.

Additional materials

  • Wendelin.core tutorial

  • Slides (pdf) from presentation about wendelin.core in PyData Paris 2015


Wendelin.core change history

  • 2.0.alpha3 (2022-12-21) More fixes discovered by on-field usage.

  • 2.0.alpha2 (2022-01-27) Fix several crashes discovered by first on-field usage.

2.0.alpha1 (2021-11-16)

This is a major pre-release that reduces wendelin.core RAM consumption dramatically:

The project switches to be mainly using kernel virtual memory manager. Bigfiles are now primarily accessed with plain OS-level mmap to files from synthetic WCFS filesystem. This makes bigfile’s cache (now it is the kernel’s pagecache) to be shared in between several processes.

In addition a custom coherency protocol is provided, which allows clients, that want to receive isolation guarantee (“I” from ACID), to still use the shared cache and at the same time get bigfile view isolated from other’s changes.

By default wendelin.core python client continues to provide full ACID semantics as before.

In addition to being significantly more efficient, WCFS also fixes data-corruption bugs that were discovered in how Wendelin.core 1 handles invalidations on BTree topology change.

Please see wcfs.go for description of the new filesystem.

Major steps: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21.

0.13 (2019-06-18)

  • Add support for Python 3.7 (commit).

  • Add RAMArray class which is compatible to ZBigArray in API and semantic, but stores its data in RAM only (commit 1, 2).

  • Add lib.xnumpy.structured - utility to create structured view of an array (commit 1, 2).

  • Fix logic to keep ZBigFileH in sync with ZODB connection (commit).

  • Fix crash on PyVMA deallocation (commit).

  • Move py.bench to pygolang so that it can be used not only in Wendelin.core (commit).

  • Enhance t/qemu-runlinux - utility that is used to work on combined kernel/user-space workloads (commit 1, 2, 3, 4, 5, 6). This was in particular useful to develop Linux kernel fixes that are needed for Wendelin.core 2.0 (kernel commit 1, 2, 3, 4, 5, 6, 7).

  • Various bugfixes.

0.12 (2018-04-16)

  • Update licensing to be in line with whole Nexedi stack (commit). Please see https://www.nexedi.com/licensing for details, rationale and options.

  • Add ArrayRef utility to find out for a NumPy array its top-level root parent and how to recreate the array as some view of the root; this builds the foundation for e.g. sending arrays as references without copy in CMFActivity joblib backend (commit 1, 2, 3).

  • Don’t crash if during loadblk() garbage collection was run twice at tricky times (commit 1, 2).

  • Don’t crash on writeout if previously storeblk() resulted in error (commit).

  • Fix py.bench and rework it to produce output in Go benchmarking format (commit 1, 2, 3, 4, 5); add benchmarks for handling pagefaults (commit).

  • Use zodbtools/zodburi, if available, to open database by URL (commit).

  • Start to make sure it works with ZODB5 too (commit 1, 2).

  • Various bugfixes.

0.11 (2017-03-28)

  • Switch back to using ZBlk0 format by default (commit)

0.10 (2017-03-16)

  • Tell the world dtype=object is not supported (commit)

0.9 (2017-01-17)

  • Avoid deadlocks via doing storeblk() calls with virtmem lock released (commit 1, 2)

  • Don’t crash if in loadblk() implementation an exception is internally raised & caught (commit 1, 2, 3)

0.8 (2016-09-28)

  • Do not leak memory when loading data in ZBlk1 format (commit).

0.7 (2016-07-14)

  • Add support for Python 3.5 (commit 1, 2)

  • Fix bug in pagemap code which could lead to crashes and other issues (commit)

  • Various bugfixes

0.6 (2016-06-13)

  • Add support for FORTRAN ordering (commit 1, 2)

  • Avoid deadlocks via doing loadblk() calls with virtmem lock released (commit 1, 2)

  • Various bugfixes

0.5 (2015-10-02)

  • Introduce another storage format, which is optimized for small changes, and make it the default. (commit 1, 2)

  • Various bugfixes and documentation improvements

0.4 (2015-08-19)

  • Add support for O(δ) in-place BigArray.append() (commit)

  • Implement proper multithreading support (commit)

  • Implement proper RAM pages invalidation when backing ZODB objects are changed from outside (commit 1, 2)

  • Fix all kind of failures that could happen when ZODB connection changes worker thread in-between handling requests (commit 1, 2)

  • Tox tests now cover usage with FileStorage, ZEO and NEO ZODB storages (commit 1, 2)

  • Various bugfixes

0.3 (2015-06-12)

  • Add support for automatic BigArray -> ndarray conversion, so that e.g. the following:

    A = BigArray(...)
    numpy.mean(A)       # passing BigArray to plain NumPy function

    either succeeds, or raises MemoryError if not enough address space is available to cover whole A. (current limitation is ~ 127TB on linux/amd64)

    (commit)

  • Various bugfixes (build-fixes, crashes, overflows, etc)

0.2 (2015-05-25)

  • Add support for O(1) in-place BigArray.resize() (commit)

  • Various build bugfixes (older systems, non-std python, etc)

0.1 (2015-04-03)

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wendelin.core-2.0a3.tar.gz (3.6 MB view details)

Uploaded Source

File details

Details for the file wendelin.core-2.0a3.tar.gz.

File metadata

  • Download URL: wendelin.core-2.0a3.tar.gz
  • Upload date:
  • Size: 3.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.8.2 requests/2.26.0 setuptools/44.1.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/2.7.18

File hashes

Hashes for wendelin.core-2.0a3.tar.gz
Algorithm Hash digest
SHA256 3f551efbee71a685234878856a894545ffb2d28bfc60e274f23b4907091c6f1f
MD5 d9f83b5c7dd38519ef05e86451c6e607
BLAKE2b-256 8d1281433092033a2275f9fde207ec502b848e82dbe75b53b60cc6c2ab3ee496

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page