Utilities for Dask and cuDF interactions

These details have not been verified by PyPI

Project links

Homepage

Project description

Dask cuDF - A GPU Backend for Dask DataFrame

Dask cuDF (a.k.a. dask-cudf or dask_cudf) is an extension library for Dask DataFrame that provides a Pandas-like API for parallel and larger-than-memory DataFrame computing on GPUs. When installed, Dask cuDF is automatically registered as the "cudf" dataframe backend for Dask DataFrame.

[!IMPORTANT] Dask cuDF does not provide support for multi-GPU or multi-node execution on its own. You must also deploy a distributed cluster (ideally with Dask-CUDA) to leverage multiple GPUs efficiently.

Using Dask cuDF

Please visit the official documentation page for detailed information about using Dask cuDF.

Installation

See the RAPIDS install page for the most up-to-date information and commands for installing Dask cuDF and other RAPIDS packages.

Resources

Dask cuDF documentation
Best practices
cuDF documentation
10 Minutes to cuDF and Dask cuDF
Dask-CUDA documentation
Deployment
RAPIDS Community: Get help, contribute, and collaborate.

Quick-start example

A very common Dask cuDF use case is single-node multi-GPU data processing. These workflows typically use the following pattern:

import dask
import dask.dataframe as dd
from dask_cuda import LocalCUDACluster
from distributed import Client

if __name__ == "__main__":

  # Define a GPU-aware cluster to leverage multiple GPUs
  client = Client(
    LocalCUDACluster(
      CUDA_VISIBLE_DEVICES="0,1",  # Use two workers (on devices 0 and 1)
      rmm_pool_size=0.9,  # Use 90% of GPU memory as a pool for faster allocations
      enable_cudf_spill=True,  # Improve device memory stability
      local_directory="/fast/scratch/",  # Use fast local storage for spilling
    )
  )

  # Set the default dataframe backend to "cudf"
  dask.config.set({"dataframe.backend": "cudf"})

  # Create your DataFrame collection from on-disk
  # or in-memory data
  df = dd.read_parquet("/my/parquet/dataset/")

  # Use cudf-like syntax to transform and/or query your data
  query = df.groupby('item')['price'].mean()

  # Compute, persist, or write out the result
  query.head()

If you do not have multiple GPUs available, using LocalCUDACluster is optional. However, it is still a good idea to enable cuDF spilling.

If you wish to scale across multiple nodes, you will need to use a different mechanism to deploy your Dask-CUDA workers. Please see the RAPIDS deployment documentation for more instructions.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

24.10.1

Oct 10, 2024

24.8.3

Sep 17, 2024

24.8.2

Aug 15, 2024

24.8.1

Aug 8, 2024

24.6.1

Jul 2, 2024

24.6.0

Jun 6, 2024

24.4.1

May 1, 2024

24.4.0

Apr 11, 2024

24.2.2

Feb 28, 2024

24.2.1

Feb 15, 2024

24.2.0

Feb 15, 2024

23.12.0.231011 yanked

Oct 11, 2023

23.12.0

Dec 7, 2023

23.10.2

Nov 17, 2023

23.10.1

Nov 7, 2023

23.10.0

Oct 12, 2023

23.8.0

Aug 10, 2023

23.6.0

Jun 9, 2023

23.4.1

Apr 21, 2023

23.4.0.1681364034

Apr 13, 2023

23.4.0

Apr 20, 2023

23.2.0

Feb 10, 2023

22.12.0.post1

Jan 25, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dask_cudf_cu11-24.10.1.tar.gz (2.3 kB view details)

Uploaded Oct 10, 2024 Source

File details

Details for the file dask_cudf_cu11-24.10.1.tar.gz.

File metadata

Download URL: dask_cudf_cu11-24.10.1.tar.gz
Upload date: Oct 10, 2024
Size: 2.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.0.0 CPython/3.10.12

File hashes

Hashes for dask_cudf_cu11-24.10.1.tar.gz
Algorithm	Hash digest
SHA256	`57079c1ac3801ead283f23936e466070acfde27283386f21eddfd619dc2a3a02`
MD5	`ba586235e60dfc8a4f16198c54416806`
BLAKE2b-256	`5d87eccf1fd2fbe758506eec8b1cda7f12d8d7d0104f4ca7d40c30189dffc133`