Low-impact, task-level memory profiling for Dask.
Project description
dask-memusage
If you're using Dask with tasks that use a lot of memory, RAM is your bottleneck for parallelism. That means you want to know how much memory each task uses:
- So you can set the highest parallelism level (process or threads) for each machine, given available to RAM.
- In order to know where to focus memory optimization efforts.
dask-memusage
is an MIT-licensed statistical memory profiler for Dask's Distributed scheduler that can help you with both these problems.
dask-memusage
polls your processes for memory usage and records the minimum and maximum usage in a CSV:
task_key,min_memory_mb,max_memory_mb
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 0)",44.84765625,96.98046875
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 1)",47.015625,97.015625
"('sum-part-e15703211a549e75b11c63e0054b53e5', 0)",0,0
"('sum-part-e15703211a549e75b11c63e0054b53e5', 1)",0,0
sum-aggregate-apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,0,0
apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,47.265625,47.265625
task_key,min_memory_mb,max_memory_mb
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 0)",44.84765625,96.98046875
"('from_sequence-map-sum-part-e15703211a549e75b11c63e0054b53e5', 1)",47.015625,97.015625
"('sum-part-e15703211a549e75b11c63e0054b53e5', 0)",0,0
"('sum-part-e15703211a549e75b11c63e0054b53e5', 1)",0,0
sum-aggregate-apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,0,0
apply-no_allocate-4c30eb545d4c778f0320d973d9fc8ea6,47.265625,47.265625
Usage
Important: Make sure your workers only have a single thread! Otherwise the results will be wrong.
Installation
On the machine where you are running the Distributed scheduler, run:
$ pip install dask_memusage
Or if you're using Conda:
$ conda install -c conda-forge dask-memusage
API usage
# Add to your Scheduler object, which is e.g. your LocalCluster's scheduler
# attribute:
from dask_memoryusage import install
install(scheduler, "/tmp/memusage.csv")
CLI usage
$ dask-scheduler --preload dask_memusage --memusage.csv /tmp/memusage.csv
Limitations
- Again, make sure you only have one thread per worker process.
- This is statistical profiling, running every 10ms. Tasks that take less than that won't have accurate information.
Help
Need help? File a ticket at https://github.com/itamarst/dask-memusage/issues/new
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for dask_memusage-1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3024bcd9189ac611d2576ab8b3941dd41ea466f1933dd131cf4650f81a4677c4 |
|
MD5 | 12630a210959fa028c7c04e651b1ee67 |
|
BLAKE2b-256 | e051499c565202a5b892bd9ac5ba98c458d0cf6d1ec9b0b784db20a4e0f5b5cd |