Command line interface to and serialization format for Blosc
Project description
- Contact:
- valentin@haenel.co
- List:
- Github:
- PyPi:
- Binstar:
- Ohloh:
- Version:
- Travis CI:
- drone.io:
- Coveralls:
- Downloads:
- Waffle:
- License:
- And…:
Description
Command line interface to and serialization format for Blosc, a high performance, multi-threaded, blocking and shuffling compressor. Uses python-blosc bindings to interface with Blosc. Also comes with native support for efficiently serializing and deserializing Numpy arrays.
Dependencies
Python 2.6 (requires packages listed in 2.6_requirements.txt) or Python 2.7
python-blosc (provides Blosc) and Numpy (as listed in requirements.txt for running the code
The Python packages listed in test_requirements.txt for testing
Stability of File Format
The tool is considered alpha-stage, experimental, research software. It is not unlikely that the internal storage format for the compressed files will change in future. Please do not depend critically on the files generated (unless you know what you are doing) by Bloscpack. See the warranty disclaimer in the licence at the end of this file.
Installation
Disclaimer: There are a myriad ways of installing Python packages (and their dependencies) these days and it is a futile endeavour to explain the procedures in great detail again and again. Below are three methods that are known to work. Depending on the method you choose and the system your are using you may require any or all of: super user privileges, a C++ compiler and/or a virtual environment. If you do run into problems or are unsure, it’s best to send an email to the aforementioned mailing list asking for help.
The package is available on PyPi, so you may use pip to install the dependencies and bloscpack itself:
$ pip install bloscpack
If you want to install straight from GitHub, use pip’s VCS support:
$ pip install git+https://github.com/Blosc/bloscpack
Or, of course, download the source code or clone the repository and then use the standard setup.py:
$ git clone https://github.com/Blosc/bloscpack
$ cd bloscpack
$ python setup.py install
Usage
Bloscpack is accessible from the command line using the blpk executable this has a number of global options and four subcommands: [c | compress], [d | decompress], [a | append] and [i | info] most of which each have their own options.
Help for global options and subcommands:
$ blpk --help
[...]
Help for each one of the subcommands:
$ blpk compress --help
[...]
$ blpk decompress --help
[...]
$ blpk info --help
[...]
$ blpk append --help
[...]
Examples
Basics
Basic compression:
$ blpk compress data.dat
Or:
$ blpk c data.dat
… will compress the file data.dat to data.dat.blp
Basic decompression:
$ blpk decompress data.dat.blp data.dcmp
Or:
$ blpk d data.dat.blp data.dcmp
… will decompress the file data.dat.blp to the file data.dcmp. If you leave out the [<out_file>] argument, Bloscpack will complain that the file data.dat exists already and refuse to overwrite it:
$ blpk decompress data.dat.blp
blpk: error: output file 'data.dat' exists!
If you know what you are doing, you can use the global option [-f | --force] to override the overwrite checks:
$ blpk --force decompress data.dat.blp
Incidentally this works for compression too:
$ blpk compress data.dat
blpk: error: output file 'data.dat.blp' exists!
$ blpk --force compress data.dat
Lastly, if you want a different filename:
$ blpk compress data.dat custom.filename.blp
… will compress the file data.dat to custom.filename.blp
Settings
By default, the number of threads that Blosc uses during compression and decompression is determined by the number of cores detected on your system. You can change this using the [-n | --nthreads] option:
$ blpk --nthreads 1 compress data.dat
Compression with Blosc is controlled with the following options:
[-t | --typesize] Typesize used by Blosc (default: 8): $ blpk compress --typesize 8 data.dat
[-l | --level] Compression level (default: 7): $ blpk compress --level 3 data.dat
[-s | --no-shuffle] Deactivate shuffle: $ blpk compress --no-shuffle data.dat
[-c | --codec] Use alternative codec: $ blpk compress --codec lz4 data.dat
In addition, there are the following options that control the Bloscpack file:
[-z | --chunk-size] Desired approximate size of the chunks, where you can use human readable strings like 8M or 128K or max to use the maximum chunk size of apprx. 2GB (default: 1MB): $ blpk compress --chunk-size 128K data.dat or $ blpk c -z max data.dat
[-k | --checksum <checksum>] Chose which checksum to use. The following values are permissible: None, adler32, crc32, md5, sha1, sha224, sha256, sha384, sha512, (default: adler32). As described in the header format, each compressed chunk can be stored with a checksum, which aids corruption detection on decompression: $ blpk compress --checksum crc32 data.dat
[-o | --no-offsets] By default, offsets to the individual chunks are stored. These are included to allow for partial decompression in the future. This option disables that feature. Also, a certain number of offsets (default: 10 * ‘nchunks’) are preallocated to allow for appending data to the file: $ blpk compress --no-offsets data.dat
Info Subcommand
If you just need some info on how the file was compressed [i | info]:
$ blpk info data.dat.blp
blpk: BloscpackHeader:
blpk: format_version: 3
blpk: offsets: True
blpk: metadata: False
blpk: checksum: 'adler32'
blpk: typesize: 8
blpk: chunk_size: 512.0M (536870912B)
blpk: last_chunk: 501.88M (526258176B)
blpk: nchunks: 3
blpk: max_app_chunks: 30
blpk: 'offsets':
blpk: [296,78074317,140782616,...]
Adding Metdata
Using the [-m | --metadata] option you can include JSON from a file:
$ cat meta.json
{"dtype": "float64", "shape": [200000000], "container": "numpy"}
$ blpk compress --chunk-size=512M --metadata meta.json data.dat
$ blpk info data.dat.blp
blpk: BloscpackHeader:
blpk: format_version: 3
blpk: offsets: True
blpk: metadata: True
blpk: checksum: 'adler32'
blpk: typesize: 8
blpk: chunk_size: 512.0M (536870912B)
blpk: last_chunk: 501.88M (526258176B)
blpk: nchunks: 3
blpk: max_app_chunks: 30
blpk: 'offsets':
blpk: [922,78074943,140783242,...]
blpk: 'metadata':
blpk: { u'container': u'numpy', u'dtype': u'float64', u'shape': [200000000]}
blpk: MetadataHeader:
blpk: magic_format: 'JSON'
blpk: meta_options: '00000000'
blpk: meta_checksum: 'adler32'
blpk: meta_codec: 'zlib'
blpk: meta_level: 6
blpk: meta_size: 59.0B (59B)
blpk: max_meta_size: 590.0B (590B)
blpk: meta_comp_size: 58.0B (58B)
blpk: user_codec: ''
It will be printed when decompressing:
$ blpk decompress data.dat.blp
blpk: Metadata is:
blpk: '{u'dtype': u'float64', u'shape': [200000000], u'container': u'numpy'}'
Appending
You can also append data to an existing bloscpack compressed file:
$ blpk append data.dat.blp data.dat
However there are certain limitations on the amount of data can be appended. For example, if there is an offsets section, there must be enough room to store the offsets for the appended chunks. If no offsets exists, you may append as much data as possible given the limitations governed by the maximum number of chunks and the chunk-size. Additionally, there are limitations on the compression options. For example, one cannot change the checksum used. It is however possible to change the compression level, the typesize and the shuffle option for the appended chunks.
Also note that appending is still considered experimental as of v0.5.0.
Verbose and Debug mode
Lastly there are two mutually exclusive options to control how much output is produced.
The first causes basic info to be printed, [-v | --verbose]:
$ blpk --verbose compress --chunk-size 0.5G data.dat
blpk: using 4 threads
blpk: getting ready for compression
blpk: input file is: 'data.dat'
blpk: output file is: 'data.dat.blp'
blpk: input file size: 1.49G (1600000000B)
blpk: nchunks: 3
blpk: chunk_size: 512.0M (536870912B)
blpk: last_chunk_size: 501.88M (526258176B)
blpk: output file size: 198.39M (208028617B)
blpk: compression ratio: 7.691250
blpk: done
… and [-d | --debug] prints a detailed account of what is going on:
$ blpk --debug compress --chunk-size 0.5G data.dat
blpk: command line argument parsing complete
blpk: command line arguments are:
blpk: force: False
blpk: verbose: False
blpk: offsets: True
blpk: checksum: adler32
blpk: subcommand: compress
blpk: out_file: None
blpk: metadata: None
blpk: cname: blosclz
blpk: in_file: data.dat
blpk: chunk_size: 536870912
blpk: debug: True
blpk: shuffle: True
blpk: typesize: 8
blpk: clevel: 7
blpk: nthreads: 4
blpk: using 4 threads
blpk: getting ready for compression
blpk: input file is: 'data.dat'
blpk: output file is: 'data.dat.blp'
blpk: input file size: 1.49G (1600000000B)
blpk: nchunks: 3
blpk: chunk_size: 512.0M (536870912B)
blpk: last_chunk_size: 501.88M (526258176B)
blpk: BloscArgs:
blpk: typesize: 8
blpk: clevel: 7
blpk: shuffle: True
blpk: cname: 'blosclz'
blpk: BloscpackArgs:
blpk: offsets: True
blpk: checksum: 'adler32'
blpk: max_app_chunks: <function <lambda> at 0x1182de8>
blpk: metadata_args will be silently ignored
blpk: max_app_chunks is a callable
blpk: max_app_chunks was set to: 30
blpk: BloscpackHeader:
blpk: format_version: 3
blpk: offsets: True
blpk: metadata: False
blpk: checksum: 'adler32'
blpk: typesize: 8
blpk: chunk_size: 512.0M (536870912B)
blpk: last_chunk: 501.88M (526258176B)
blpk: nchunks: 3
blpk: max_app_chunks: 30
blpk: raw_bloscpack_header: 'blpk\x03\x01\x01\x08\x00\x00\x00 \x00\x10^\x1f\x03\x00\x00\x00\x00\x00\x00\x00\x1e\x00\x00\x00\x00\x00\x00\x00'
blpk: Handle chunk '0'
blpk: checksum (adler32): '\x1f\xed\x1e\xf4'
blpk: chunk handled, in: 512.0M (536870912B) out: 74.46M (78074017B)
blpk: Handle chunk '1'
blpk: checksum (adler32): ')\x1e\x08\x88'
blpk: chunk handled, in: 512.0M (536870912B) out: 59.8M (62708295B)
blpk: Handle chunk '2' (last)
blpk: checksum (adler32): '\xe8\x18\xa4\xac'
blpk: chunk handled, in: 501.88M (526258176B) out: 64.13M (67245997B)
blpk: Writing '3' offsets: '[296, 78074317, 140782616]'
blpk: Raw offsets: '(\x01\x00\x00\x00\x00\x00\x00\xcdQ\xa7\x04\x00\x00\x00\x00\x18,d\x08\x00\x00\x00\x00'
blpk: output file size: 198.39M (208028617B)
blpk: compression ratio: 7.691250
blpk: done
Python API
The Python API is still in flux, so this section is deliberately sparse.
Numpy
Numpy arrays can be serialized as Bloscpack files, here is a very brief example:
>>> a = np.linspace(0, 1, 3e8)
>>> print a.size, a.dtype
300000000 float64
>>> bp.pack_ndarray_file(a, 'a.blp')
>>> b = bp.unpack_ndarray_file('a.blp')
>>> (a == b).all()
True
Looking at the generated file, we can see the Numpy metadata being saved:
$ lh a.blp
-rw------- 1 esc esc 266M Aug 13 23:21 a.blp
$ blpk info a.blp
blpk: BloscpackHeader:
blpk: format_version: 3
blpk: offsets: True
blpk: metadata: True
blpk: checksum: 'adler32'
blpk: typesize: 8
blpk: chunk_size: 1.0M (1048576B)
blpk: last_chunk: 838.0K (858112B)
blpk: nchunks: 2289
blpk: max_app_chunks: 22890
blpk: 'offsets':
blpk: [202170,408064,554912,690452,819679,...]
blpk: 'metadata':
blpk: { u'container': u'numpy',
blpk: u'dtype': u'<f8',
blpk: u'order': u'C',
blpk: u'shape': [300000000]}
blpk: MetadataHeader:
blpk: magic_format: 'JSON'
blpk: meta_options: '00000000'
blpk: meta_checksum: 'adler32'
blpk: meta_codec: 'zlib'
blpk: meta_level: 6
blpk: meta_size: 67.0B (67B)
blpk: max_meta_size: 670.0B (670B)
blpk: meta_comp_size: 62.0B (62B)
blpk: user_codec: ''
Alternatively, we can also use a string as storage:
>>> a = np.linspace(0, 1, 3e8)
>>> c = pack_ndarray_str(a)
>>> b = unpack_ndarray_str(c)
>>> (a == b).all()
True
Or use alternate compressors:
>>> a = np.linspace(0, 1, 3e8)
>>> c = pack_ndarray_str(a, blosc_args=BloscArgs(cname='lz4'))
>>> b = unpack_ndarray_str(c)
>>> (a == b).all()
True
If you are interested in the performance of Bloscpack compared to other serialization formats for Numpy arrays, please look at the benchmarks presented in the Bloscpack paper from the EuroScipy 2013 conference proceedings.
Testing
Installing Dependencies
Testing requires some additional libraries, which you can install from PyPi with:
$ pip install -r test_requirements.txt
[...]
Basic Tests
Basic tests, runs quickly:
$ nosetests
[...]
Heavier Tests
Extended tests using a larger file, may take some time, but will be nice to memory:
$ nosetests test/test_file_io.py:pack_unpack_hard
[...]
Extended tests using a huge file. This one take forever and needs loads (5G-6G) of memory and loads of disk-space (10G). Use -s to print progress:
$ nosetests -s test/test_file_io.py:pack_unpack_extreme
[...]
Note that, some compression/decompression tests create temporary files (on UNIXoid systems this is under /tmp/blpk*) which are deleted upon completion of the respective test, both successful and unsuccessful, or when the test is aborted with e.g. ctrl-c (using atexit magic).
Under rare circumstances, for example when aborting the deletion which is triggered on abort you may be left with large files polluting your temporary space. Depending on your partitioning scheme etc.. doing this repeatedly, may lead to you running out of space on the file-system.
Command Line Interface Tests
The command line interface is tested with cram:
$ cram --verbose test_cmdline/*.cram
[...]
Coverage
To determine coverage you can pool together the coverage from the cram tests and the unit tests:
$ COVERAGE=1 cram --verbose test_cmdline/*.cram
[...]
$nosetests --with-coverage --cover-package=bloscpack
[...]
Test Runner
To run the command line interface tests and the unit tests and analyse coverage, use the convenience test.sh runner:
$ ./test.sh
[...]
Benchmark
Using the provided bench/blpk_vs_gzip.py script on a Intel(R) Core(TM) i7-3667U CPU @ 2.00GHz CPU with 2 cores and 4 threads (active hyperthreading), cpu frequency scaling activated but set to the performance governor (all cores scaled to 2.0 GHz), 8GB of DDR3 memory and a Luks encrypted SSD, we get:
$ PYTHONPATH=. ./bench/blpk_vs_gzip.py
create the test data..........done
Input file size: 1.49G
Will now run bloscpack...
Time: 2.06 seconds
Output file size: 198.55M
Ratio: 7.69
Will now run gzip...
Time: 134.20 seconds
Output file size: 924.05M
Ratio: 1.65
As was expected from previous benchmarks of Blosc using the python-blosc bindings, Blosc is both much faster and has a better compression ratio for this kind of structured data. One thing to note here, is that we are not dropping the system file cache after every step, so the file to read will be cached in memory. To get a more accurate picture we can use the --drop-caches switch of the benchmark which requires you however, to run the benchmark as root, since dropping the caches requires root privileges:
$ PYTHONPATH=. ./bench/blpk_vs_gzip.py --drop-caches
will drop caches
create the test data..........done
Input file size: 1.49G
Will now run bloscpack...
Time: 13.49 seconds
Output file size: 198.55M
Ratio: 7.69
Will now run gzip...
Time: 137.49 seconds
Output file size: 924.05M
Ratio: 1.65
Optimizing Chunk Size
You can use the provided bench/compression_time_vs_chunk_size.py file to optimize the chunk-size for a given machine. For example:
$ sudo env PATH=$PATH PYTHONPATH=. bench/compression_time_vs_chunk_size.py
create the test data..........done
chunk_size comp-time decomp-time ratio
512.0K 8.106235 10.243908 7.679094
724.08K 4.424007 12.284307 7.092846
1.0M 6.243544 11.978932 7.685173
1.41M 4.715511 10.780901 7.596981
2.0M 4.548568 10.676304 7.688216
2.83M 4.851359 11.668394 7.572480
4.0M 4.557665 10.127647 7.689736
5.66M 4.589349 9.579627 7.667467
8.0M 5.290080 10.525652 7.690499
Running the script requires super user privileges, since you need to synchronize disk writes and drop the file system caches for less noisy results. Also, you should probably run this script a couple of times and inspect the variability of the results.
Bloscpack Format
The input is split into chunks since a) we wish to put less stress on main memory and b) because Blosc has a buffer limit of 2GB (Version 1.0.0 and above). By default the chunk-size is a moderate 1MB which should be fine, even for less powerful machines.
In addition to the chunks some additional information must be added to the file for housekeeping:
- header:
a 32 bit header containing various pieces of information
- meta:
a variable length metadata section, may contain user data
- offsets:
a variable length section containing chunk offsets
- chunk:
the blosc chunk(s)
- checksum:
a checksum following each chunk, if desired
The layout of the file is then:
|-header-|-meta-|-offsets-|-chunk-|-checksum-|-chunk-|-checksum-|...|
Description of the header
The following 32 bit header is used for Bloscpack as of version 0.3.0. The design goals of the header format are to contain as much information as possible to achieve interesting things in the future and to be as general as possible such that the persistence layer of Blaze/BLZ can be implemented without modification of the header format.
The following ASCII representation shows the layout of the header:
|-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-| | b l p k | ^ | ^ | ^ | ^ | chunk-size | last-chunk | | | | | version ----+ | | | options --------+ | | checksum ------------+ | typesize ----------------+ |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-| | nchunks | max-app-chunks |
The first 4 bytes are the magic string blpk. Then there are 4 bytes which hold information about the activated features in this file. This is followed by 4 bytes for the chunk-size, another 4 bytes for the last-chunk-size, 8 bytes for the number of chunks, nchunks and lastly 8 bytes for the total number of chunks that can be appended to this file, max-app-chunks.
Effectively, storing the number of chunks as a signed 8 byte integer, limits the number of chunks to 2**63-1 = 9223372036854775807, but this should not be relevant in practice, since, even with the moderate default value of 1MB for chunk-size, we can still store files as large as 8ZB (!) Given that in 2012 the maximum size of a single file in the Zettabye File System (zfs) is 16EB, Bloscpack should be safe for a few more years.
Description of the header entries
All entries are little-endian.
- version:
(uint8) format version of the Bloscpack header, to ensure exceptions in case of forward incompatibilities.
- options:
(bitfield) A bitfield which allows for setting certain options in this file.
- bit 0 (0x01):
If the offsets to the chunks are present in this file.
- bit 1 (0x02):
If metadata is present in this file.
- checksum:
(uint8) The checksum used. The following checksums, available in the python standard library should be supported. The checksum is always computed on the compressed data and placed after the chunk.
- 0:
no checksum
- 1:
zlib.adler32
- 2:
zlib.crc32
- 3:
hashlib.md5
- 4:
hashlib.sha1
- 5:
hashlib.sha224
- 6:
hashlib.sha256
- 7:
hashlib.sha384
- 8:
hashlib.sha512
- typesize:
(uint8) The typesize of the data in the chunks. Currently, assume that the typesize is uniform. The space allocated is the same as in the Blosc header.
- chunk-size:
(int32) Denotes the chunk-size. Since the maximum buffer size of Blosc is 2GB having a signed 32 bit int is enough (2GB = 2**31 bytes). The special value of -1 denotes that the chunk-size is unknown or possibly non-uniform.
- last-chunk:
(int32) Denotes the size of the last chunk. As with the chunk-size an int32 is enough. Again, -1 denotes that this value is unknown.
- nchunks:
(int64) The total number of chunks used in the file. Given a chunk-size of one byte, the total number of chunks is 2**63. This amounts to a maximum file-size of 8EB (8EB = 2*63 bytes) which should be enough for the next couple of years. Again, -1 denotes that the number of is unknown.
- max-app-chunks:
(int64) The maximum number of chunks that can be appended to this file, excluding nchunks. This is only useful if there is an offsets section and if nchunks is known (not -1), if either of these conditions do not apply this should be 0.
The overall file-size can be computed as chunk-size * (nchunks - 1) + last-chunk-size. In a streaming scenario -1 can be used as a placeholder. For example if the total number of chunks, or the size of the last chunk is not known at the time the header is created.
The following constraints exist on the header entries:
last-chunk must be less than or equal to chunk-size.
nchunks + max_app_chunks must be less than or equal to the maximum value of an int64.
Description of the metadata section
This section goes after the header. It consists of a metadata-section header followed by a serialized and potentially compressed data section, followed by preallocated space to resize the data section, possibly followed by a checksum.
The layout of the section is thus:
|-metadata-header-|-data-|-prealloc-|-checksum-|
The header has the following layout:
|-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-| | magic-format | ^ | ^ | ^ | ^ | meta-size | | | | | meta-options -------+ | | | meta-checksum ----------+ | | meta-codec -----------------+ | meta-level ---------------------+ |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-| | max-meta-size |meta-comp-size | user-codec |
- magic-format:
(8 byte ASCII string) The data will usually be some kind of binary serialized string data, for example JSON, BSON, YAML or Protocol-Buffers. The format identifier is to be placed in this field.
- meta-options:
(bitfield) A bitfield which allows for setting certain options in this metadata section. Currently unused
- meta-checksum:
The checksum used for the metadata. The same checksums as for the data are available.
- meta-codec:
(unit8) The codec used for compressing the metadata. As of Bloscpack version 0.3.0 the following codecs are supported.
- 0:
no codec
- 1:
zlib (DEFLATE)
- meta-level:
(unit8) The compression level used for the codec. If codec is 0 i.e. the metadata is not compressed, this must be 0 too.
- meta-size:
(uint32) The size of the uncompressed metadata.
- max-meta-size:
(uint32) The total allocated space for the data section.
- meta-comp-size:
(uint32) If the metadata is compressed, this gives the total space the metadata occupies. If the data is not compressed this is the same as meta-size. In a sense this is the true amount of space in the metadata section that is used.
- user-codec:
Space reserved for usage of additional codecs. E.g. 4 byte magic string for codec identification and 4 bytes for encoding of codec parameters.
The total space left for enlarging the metadata section is simply: max-meta-size - meta-comp-size.
JSON Example of serialized metadata:
'{"dtype": "float64", "shape": [1024], "others": []}'
If compression is requested, but not beneficial, because the compressed size would be larger than the uncompressed size, compression of the metadata is automatically deactivated.
As of Bloscpack version 0.3.0 only the JSON serializer is supported and used the string JSON followed by four whitespace bytes as identifier. Since JSON and any other of the suggested serializers has limitations, only a subset of Python structures can be stored, so probably some additional object handling must be done prior to serialize certain kinds of metadata.
Description of the offsets entries
Following the metadata section, comes a variable length section of chunk offsets. Offsets of the chunks into the file are to be used for accelerated seeking. The offsets (if activated) follow the header. Each offset is a 64 bit signed little-endian integer (int64). A value of -1 denotes an unknown offset. Initially, all offsets should be initialized to -1 and filled in after writing all chunks. Thus, If the compression of the file fails prematurely or is aborted, all offsets should have the value -1. Also, any unused offset entries preallocated to allow the file to grow should be set to -1. Each offset denotes the exact position of the chunk in the file such that seeking to the offset, will position the file pointer such that, reading the next 16 bytes gives the Blosc header, which is at the start of the desired chunk.
Description of the chunk format
As mentioned previously, each chunk is just a Blosc compressed string including header. The Blosc header (as of v1.0.0) is 16 bytes as follows:
|-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-| ^ ^ ^ ^ | nbytes | blocksize | ctbytes | | | | | | | | +--typesize | | +------flags | +----------versionlz +--------------version
The first four are simply bytes, the last three are are each unsigned ints (uint32) each occupying 4 bytes. The header is always little-endian. ctbytes is the length of the buffer including header and nbytes is the length of the data when uncompressed. A more detailed description of the Blosc header can be found in the README_HEADER.rst of the Blosc repository
Overhead
Depending on which configuration for the file is used a constant, or linear overhead may be added to the file. The Bloscpack header adds 32 bytes in any case. If the data is non-compressible, Blosc will add 16 bytes of header to each chunk. The metadata section obviously adds a constant overhead, and if used, both the checksum and the offsets will add overhead to the file. The offsets add 8 bytes per chunk and the checksum adds a fixed constant value which depends on the checksum to each chunk. For example, 32 bytes for the adler32 checksum.
Coding Conventions
Numpy rst style docstrings
README cli examples should use long options
testing: expected before received nt.assert_equal(expected, received)
Debug messages: as close to where the data was generated
Single quotes around ambiguities in messages overwriting existing file: 'testfile'
Exceptions instead of exit
nose test generators parameterized tests
Use the Wikipedia definition of compression ratio: http://en.wikipedia.org/wiki/Data_compression_ratio
How to Optimize Logging
Some care must be taken when logging in the inner loop. For example consider the following two commits:
https://github.com/Blosc/bloscpack/commit/0854930514eebaf7dbc6c4dcf3589dbcb9f2fdc9
https://github.com/Blosc/bloscpack/commit/355bf90a8c13a2a1f792d43228c2a68c61476621
If there are a larger number of chunks, calls to double_pretty_size will be executed (and may be costly) even if no logging is needed.
Consider the following script, loop-bench.py:
import numpy as np
import bloscpack as bp
import blosc
shuffle = True
clevel = 9
cname = 'lz4'
a = np.arange(2.5e8)
bargs = bp.args.BloscArgs(clevel=clevel, shuffle=shuffle, cname=cname)
bpargs = bp.BloscpackArgs(offsets=False, checksum='None', max_app_chunks=0)
Timing with v0.7.0:
In [1]: %run loop-bench.py
In [2]: %timeit bpc = bp.pack_ndarray_str(a, blosc_args=bargs, bloscpack_args=bpargs)
1 loops, best of 3: 423 ms per loop
In [3]: %timeit bpc = bp.pack_ndarray_str(a, blosc_args=bargs, bloscpack_args=bpargs)
1 loops, best of 3: 421 ms per loop
In [4]: bpc = bp.pack_ndarray_str(a, blosc_args=bargs, bloscpack_args=bpargs)
In [5]: %timeit a3 = bp.unpack_ndarray_str(bpc)
1 loops, best of 3: 727 ms per loop
In [6]: %timeit a3 = bp.unpack_ndarray_str(bpc)
1 loops, best of 3: 725 ms per loop
And then using a development version that contains the two optimization commits:
In [1]: %run loop-bench.py
In [2]: %timeit bpc = bp.pack_ndarray_str(a, blosc_args=bargs, bloscpack_args=bpargs)
1 loops, best of 3: 357 ms per loop
In [3]: %timeit bpc = bp.pack_ndarray_str(a, blosc_args=bargs, bloscpack_args=bpargs)
1 loops, best of 3: 357 ms per loop
In [4]: bpc = bp.pack_ndarray_str(a, blosc_args=bargs, bloscpack_args=bpargs)
In [5]: %timeit a3 = bp.unpack_ndarray_str(bpc)
1 loops, best of 3: 658 ms per loop
In [6]: %timeit a3 = bp.unpack_ndarray_str(bpc)
1 loops, best of 3: 655 ms per loop
Comparison to HDF5/PyTables
Since Blosc has already been supported for use in HDF5 files from within PyTables, one might be tempted to question why yet another file format has to be invented. This section aims to differentiate between HDF5/PyTables and effectively argues that they are not competitors.
Lightweight vs. Heavyweight. Bloscpack is a lightweight format. The format specification can easily be digested within a day and the dependencies are minimal. PyTables is a complex piece of software and the HDF5 file format specification is a large document.
Persistence vs. Database. Bloscpack is designed to allow for fast serialization and deserialization of in-memory data. PyTables is more of a database which for example allows complex queries to be computed on the data.
Additionally there are two network uses cases which Bloscpack is suited for (but does not have support for as of yet):
Streaming: Since bloscpack without offsets can be written in a single pass it is ideally suited for streaming over a network, where you can compress send and decompress individual chunks in a streaming fashion.
Expose a file over HTTP and do partial reads from it, for example when storing a compressed file in S3. You can easily just store a file on a web server and then use the header information to read and decompress individual chunks.
Prior Art
The following is a list of important resources that were read during the conception and initial stages of Bloscpack.
The 6pack utility included with FastLZ (the codec that BloscLZ was derived from) was the initial inspiration for writing a command line interface to Blosc.
The Wikipedia article on the PNG format contains some interesting details about the PNG header and file headers in general.
The XZ File Format Specification gave rise to some ideas and techniques about writing file format specifications and using checksums for data integrity. Although the format and the document itself was a bit to heavyweight for my tastes.
The Snappy framing format and the file container format for LZ4 were also consulted, but I can’t remember if and what inspiration they gave rise to.
The homepages of zlib and gzip were also consulted at some point. The command line interface of gzip/gunzip was deemed to be from a different era and as a result git-style subcommands are used in Bloscpack.
Maintainers Notes on Cutting a Release
Set the version as environment variable VERSION=vX.X.X
Update the changelog
Commit using git commit -m "$VERSION changelog"
Set the version number in bloscpack/version.py
Commit with git commit -m "$VERSION"
Make the tag using git tag -s -m "Bloscpack $VERSION" $VERSION
Push commits to Blosc github git push blosc master
Push commits to own github git push esc master
Push the tag to Blosc github git push blosc $VERSION
Push the tag to own github git push esc $VERSION
Upload to PyPi using python setup.py sdist upload
Bump version number to next dev version
Announce release on the Blosc list
Announce release via Twitter
TODO
Documentation
Refactor monolithic readme into Sphinx and publish
Write the docstrings for the Args classes
Cleanup and double check the docstrings for the public API classes
document library usage
Announcement RST
Command Line
quiet verbosity level
Expose the ability to set ‘max_app_chunks’ from the command line
Allow to save metadata to a file during decompression
subcommand e or estimate to estimate the size of the uncompressed data.
subcommand v or verify to verify the integrity of the data
add –raw-input and –raw-output switches to allow stuff like: cat file | blpk –raw-input –raw-output compress > file.blp
Establish and document proper exit codes
Document the metadata saved during Numpy serialization
Profiling and Optimization
Use the faster version of struct where you have a single string
Memory profiler, might be able to reduce memory used by reusing the buffer during compression and decompression
Benchmark different codecs
Use line profiler to check code
Select different defaults for Numpy arrays, no offsets? no pre-alloc?
Library Features
possibly provide a BloscPackFile abstraction, like GzipFile
Allow to not-prealloc additional space for metadata
Refactor certain collections of functions that operate on data into objects
Offsets (maybe)
partial decompression?
since we now have potentially small chunks, the progressbar becomes relevant again
configuration file to store commonly used options on a given machine
print the compression time, either as verbose or debug
Investigate if we can use a StringIO object that returns memoryviews on read.
Implement a memoryview Compressed/PlainSource
Use a bytearray to read chunks from a file. Then re-use that bytearray during every read to avoid allocating deallocating strings the whole time.
The keyword arguments to many functions are global dicts, this is a bad idea, Make the immutable with a forzendict.
Check that the checksum is really being checked for all PlainSinks
Bunch of NetworkSource/Sinks
HTTPSource/Sink
Miscellaneous
check Python 3.x compatibility
Announce on scipy/numpy lists, comp.compression, freshmeat, ohloh …
Packaging and Infrastructure
Debian packages (for python-blosc and bloscpack)
Conda recipes (for python-blosc and bloscpack)
Use tox for testing multiple python versions
Build on travis and drone.io using pre-compiled
Changelog
v0.7.2 - Wed Mar 25 2015
Fix support for zero length arrays (and input in general) (#17 reported by @dmbelov)
Catch when typesize doesn’t divide chunk_size (#18 reported by @dmbelov)
Fix serialization of object arrays (#16 reported by @dmbelov)
Reject Object dtype arrays since they cannot be compressed with Bloscpack
Provide backwards compatibility for older Numpy serializations
Fix win32 compatibility of tests (#27 fixed by @mindw)
Fix using setuptools for scripts and dependencies (#28 fixed by @mindw)
Various misc fixes
v0.7.1 - Sun Jun 29 2014
Fix a bug related to setting the correct typesize when compressing Numpy arrays
Optimization of debug statements in the inner loops
v0.7.0 - Wed May 28 2014
Modularize cram tests, even has something akin to a harness
Refactored, tweaked and simplified Source/Sink code and semantics
Various documentation improvements: listing prior art, comparison to HDF5
Improve benchmarking scripts
Introduce a BloscArgs object for saner handling of the BloscArgs
Introduce a BloscpackArgs object for saner handling of the BloscpackArgs
Introduce MetadataHeader and MetdataArgs objects too
Fix all (hopefully) incorrect uses of the term ‘compression ratio’
Various miscellaneous fixes and improvements
v0.6.0 - Fri Mar 28 2014
Complete refactor of Bloscpack codebase to support modularization
Support for drone.io CI service
Improved dependency specification for Python 2.6
Improved installation instructions
v0.5.2 - Fri Mar 07 2014
Fix project url in setup.py
v0.5.1 - Sat Feb 22 2014
Documentation fixes and improvements
v0.5.0 - Sun Feb 02 2014
Moved project to the Blosc organization on Github
v0.5.0-rc1 - Thu Jan 30 2014
Support for Blosc 1.3.x (alternative codecs)
v0.4.1 - Fri Sep 27 2013
Fixed the pack_unpack_hard test suite
Fixed handling Numpy record and nested record arrays
v0.4.0 - Sun Sep 15 2013
Fix a bug when serializing numpy arrays to strings
v0.4.0-rc2 - Tue Sep 03 2013
Package available via PyPi (since 0.4.0-rc1)
Support for packing/unpacking numpy arrays to/from string
Check that string and record arrays work
Fix installation problems with PyPi package (Thanks to Olivier Grisel)
v0.4.0-rc1 - Sun Aug 18 2013
BloscpackHeader class introduced
The info subcommand shows human readable sizes when printing the header
Now using Travis-CI for testing and Coveralls for coverage
Further work on the Plain/Compressed-Source/Sink abstractions
Start using memoryview in places
Learned to serialize Numpy arrays
v0.3.0 - Sun Aug 04 2013
Minor readme fixes
Increase number of cram tests
v0.3.0-rc1 - Thu Aug 01 2013
Bloscpack format changes (format version 3)
Variable length metadata section with it’s own header
Ability to preallocate offsets for appending data (max_app_chunks)
Refactor compression and decompression to use file pointers instead of file name strings, allows using StringIO/cStringIO.
Sanitize calculation of nchunks and chunk-size
Special keyword max for use with chunk-size in the CLI
Support appending to a file and append subcommand (including the ability to preallocate offsets)
Support rudimentary info subcommand
Add tests of the command line interface using cram
Minor bugfixes and corrections as usual
v0.2.1 - Mon Nov 26 2012
Backport to Python 2.6
Typo fixes in documentation
v0.2.0 - Fri Sep 21 2012
Use atexit magic to remove test data on abort
Change prefix of temp directory to /tmp/blpk*
Merge header RFC into monolithic readme
v0.2.0-rc2 - Tue Sep 18 2012
Don’t bail out if the file is smaller than default chunk
Set the default typesize to 8 bytes
Upgrade dependencies to python-blosc v1.0.5 and fix tests
Make extreme test less resource intensive
Minor bugfixes and corrections
v0.2.0-rc1 - Thu Sep 13 2012
Implement new header format as described in RFC
Implement checksumming compressed chunks with various checksums
Implement offsets of the chunks into the file
Efforts to make the library re-entrant, better control of side-effects
README is now rst not md (flirting with sphinx)
Tons of trivial fixes, typos, wording, refactoring, renaming, pep8 etc..
v0.1.1 - Sun Jul 15 2012
Fix the memory issue with the tests
Two new suites: hard and extreme
Minor typo fixes and corrections
v0.1.0 - Thu Jun 14 2012
Freeze the first 8 bytes of the header (hopefully for ever)
Fail to decompress on non-matching format version
Minor typo fixes and corrections
v0.1.0-rc3 - Tue Jun 12 2012
Limit the chunk-size benchmark to a narrower range
After more careful experiments, a default chunk-size of 1MB was deemed most appropriate
Fixed a terrible bug, where during testing and benchmarking, temporary files were not removed, oups…
Adapted the header to have space for more chunks, include special marker for unknown chunk number (-1) and format version of the compressed file
Added a note in the README about instability of the file format
Various minor fixes and enhancements
v0.1.0-rc2 - Sat Jun 09 2012
Default chunk-size now 4MB
Human readable chunk-size argument
Last chunk now contains remainder
Pure python benchmark to compare against gzip
Benchmark to measure the effect of chunk-size
Various minor fixes and enhancements
v0.1.0-rc1 - Sun May 27 2012
Initial version
Compression/decompression
Command line argument parser
README, setup.py, tests and benchmark
Thanks
Francesc Alted for writing Blosc in the first place, for providing continual code-review and feedback on Bloscpack and for co-authoring the Bloscpack file-format specification.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file bloscpack-0.7.2.tar.gz
.
File metadata
- Download URL: bloscpack-0.7.2.tar.gz
- Upload date:
- Size: 87.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9dac6d7800b6715a86945e5493aa08c0d2400d9de5d49a15641c17ef962614cc |
|
MD5 | 305fb0703a8f12fb004c646152575e50 |
|
BLAKE2b-256 | 984a5dc9162c4fac13b170697adffaa42efc40f9b6d8be21f18a0eb59421e3f0 |