Skip to main content

Python support for Parquet file format

Reason this release was yanked:

deps not updated

Project description

https://travis-ci.org/jcrobak/parquet-python.svg?branch=master

fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows.

Not all parts of the parquet-format have been implemented yet or tested e.g. see the Todos linked below. With that said, fastparquet is capable of reading all the data files from the parquet-compatability project.

Introduction

Details of this project can be found in the documentation.

The original plan listing expected features can be found in this issue. Please feel free to comment on that list as to missing items and priorities, or raise new issues with bugs or requests.

Requirements

(all development is against recent versions in the default anaconda channels)

Required:

  • numba

  • numpy

  • pandas

  • cython

  • six

Optional (compression algorithms; gzip is always available):

  • snappy (aka python-snappy)

  • lzo

  • brotli

  • lz4

  • zstandard

Installation

Install using conda:

conda install -c conda-forge fastparquet

install from pypi:

pip install fastparquet

or install latest version from github:

pip install git+https://github.com/dask/fastparquet

For the pip methods, numba must have been previously installed (using conda).

Usage

Reading

from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()
df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])

You may specify which columns to load, which of those to keep as categoricals (if the data uses dictionary encoding). The file-path can be a single file, a metadata file pointing to other data files, or a directory (tree) containing data files. The latter is what is typically output by hive/spark.

Writing

from fastparquet import write
write('outfile.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
      compression='GZIP', file_scheme='hive')

The default is to produce a single output file with a single row-group (i.e., logical segment) and no compression. At the moment, only simple data-types and plain encoding are supported, so expect performance to be similar to numpy.savez.

History

Since early October 2016, this fork of parquet-python has been undergoing considerable redevelopment. The aim is to have a small and simple and performant library for reading and writing the parquet format from python.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastparquet-0.4.2.tar.gz (121.5 kB view details)

Uploaded Source

Built Distribution

fastparquet-0.4.2-cp37-cp37m-macosx_10_9_x86_64.whl (149.0 kB view details)

Uploaded CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file fastparquet-0.4.2.tar.gz.

File metadata

  • Download URL: fastparquet-0.4.2.tar.gz
  • Upload date:
  • Size: 121.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for fastparquet-0.4.2.tar.gz
Algorithm Hash digest
SHA256 a0829019d47c248bc64395e81ec3beb4cb2ee5099bce8724acfac9b237c36620
MD5 c6f3e6a1149b5a6944006012ad33e63e
BLAKE2b-256 b22084e9be7938539a7610813a4bfd1a2745e4b29715cb7cf1ce392232917cfe

See more details on using hashes here.

Provenance

File details

Details for the file fastparquet-0.4.2-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: fastparquet-0.4.2-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 149.0 kB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for fastparquet-0.4.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 39dbbef2fea7d330254b00a67e053a7db94287573bb6e298cb87313d795bc6a6
MD5 a403b956484e26149ac28c3d6deb2c9c
BLAKE2b-256 078d1999ed7ad337444d7cd41954b7b4154acb38452da6be402f869a3e815b7b

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page