fastparquet

Python support for Parquet file format

These details have not been verified by PyPI

Project links

Homepage

Reason this release was yanked:

deps not updated

Project description

https://travis-ci.org/jcrobak/parquet-python.svg?branch=master

fastparquet is a python implementation of the parquet format, aiming integrate into python-based big data work-flows.

Not all parts of the parquet-format have been implemented yet or tested e.g. see the Todos linked below. With that said, fastparquet is capable of reading all the data files from the parquet-compatability project.

Introduction

Details of this project can be found in the documentation.

The original plan listing expected features can be found in this issue. Please feel free to comment on that list as to missing items and priorities, or raise new issues with bugs or requests.

Requirements

(all development is against recent versions in the default anaconda channels)

Required:

numba
numpy
pandas
cython
six

Optional (compression algorithms; gzip is always available):

snappy (aka python-snappy)
lzo
brotli
lz4
zstandard

Installation

Install using conda:

conda install -c conda-forge fastparquet

install from pypi:

pip install fastparquet

or install latest version from github:

pip install git+https://github.com/dask/fastparquet

For the pip methods, numba must have been previously installed (using conda).

Usage

Reading

from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()
df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])

You may specify which columns to load, which of those to keep as categoricals (if the data uses dictionary encoding). The file-path can be a single file, a metadata file pointing to other data files, or a directory (tree) containing data files. The latter is what is typically output by hive/spark.

Writing

from fastparquet import write
write('outfile.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
      compression='GZIP', file_scheme='hive')

The default is to produce a single output file with a single row-group (i.e., logical segment) and no compression. At the moment, only simple data-types and plain encoding are supported, so expect performance to be similar to numpy.savez.

History

Since early October 2016, this fork of parquet-python has been undergoing considerable redevelopment. The aim is to have a small and simple and performant library for reading and writing the parquet format from python.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2024.11.0

Nov 12, 2024

2024.5.0

May 21, 2024

2024.2.0

Feb 8, 2024

2023.10.1

Oct 26, 2023

2023.10.0

Oct 25, 2023

2023.8.0

Aug 30, 2023

2023.7.0

Jul 1, 2023

2023.4.0

Apr 27, 2023

2023.2.0

Feb 8, 2023

2023.1.0

Jan 19, 2023

2022.12.0

Dec 5, 2022

2022.11.0

Nov 17, 2022

0.8.3

Aug 28, 2022

0.8.2

Aug 19, 2022

0.8.1

Apr 1, 2022

0.8.0

Jan 26, 2022

0.7.2

Nov 22, 2021

0.7.1

Aug 3, 2021

0.7.0

Jul 16, 2021

0.6.3

May 13, 2021

0.6.2

May 12, 2021

0.6.1

May 11, 2021

0.6.0.post1

May 6, 2021

0.6.0

May 6, 2021

0.5.0

Dec 29, 2020

This version

0.4.2 yanked

Dec 14, 2020

Reason this release was yanked:

deps not updated

0.4.1

Jul 16, 2020

0.4.0

May 12, 2020

0.3.3

Feb 5, 2020

0.3.2

Aug 1, 2019

0.3.1

Apr 25, 2019

0.3.0

Mar 30, 2019

0.2.1

Dec 18, 2018

0.2.0

Nov 22, 2018

0.1.6

Aug 19, 2018

0.1.5

Apr 1, 2018

0.1.4

Jan 27, 2018

0.1.3

Oct 8, 2017

0.1.2

Aug 28, 2017

0.1.1

Jul 21, 2017

0.1.0

Jun 13, 2017

0.0.6

May 4, 2017

0.0.5

Feb 16, 2017

0.0.4.post1

Dec 27, 2016

0.0.4

Dec 27, 2016

0.0.3

Dec 1, 2016

0.0.2

Nov 15, 2016

0.0.1.post2

Nov 1, 2016

0.0.1.post1

Nov 1, 2016

0.0.1

Nov 1, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastparquet-0.4.2.tar.gz (121.5 kB view details)

Uploaded Dec 14, 2020 Source

Built Distribution

fastparquet-0.4.2-cp37-cp37m-macosx_10_9_x86_64.whl (149.0 kB view details)

Uploaded Dec 14, 2020 CPython 3.7m macOS 10.9+ x86-64

File details

Details for the file fastparquet-0.4.2.tar.gz.

File metadata

Download URL: fastparquet-0.4.2.tar.gz
Upload date: Dec 14, 2020
Size: 121.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for fastparquet-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`a0829019d47c248bc64395e81ec3beb4cb2ee5099bce8724acfac9b237c36620`
MD5	`c6f3e6a1149b5a6944006012ad33e63e`
BLAKE2b-256	`b22084e9be7938539a7610813a4bfd1a2745e4b29715cb7cf1ce392232917cfe`

See more details on using hashes here.

Provenance

File details

Details for the file fastparquet-0.4.2-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

Download URL: fastparquet-0.4.2-cp37-cp37m-macosx_10_9_x86_64.whl
Upload date: Dec 14, 2020
Size: 149.0 kB
Tags: CPython 3.7m, macOS 10.9+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.6.0.post20200814 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.8

File hashes

Hashes for fastparquet-0.4.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm	Hash digest
SHA256	`39dbbef2fea7d330254b00a67e053a7db94287573bb6e298cb87313d795bc6a6`
MD5	`a403b956484e26149ac28c3d6deb2c9c`
BLAKE2b-256	`078d1999ed7ad337444d7cd41954b7b4154acb38452da6be402f869a3e815b7b`