simdjson bindings for python
Project description
pysimdjson
Quick-n'dirty Python bindings for simdjson just to see if going down this path might yield some parse time improvements in real-world applications. So far, the results are promising, especially when only part of a document is of interest.
Bindings are currently tested on OS X, Linux, and Windows.
See the latest documentation at http://pysimdjson.tkte.ch.
Installation
There are binary wheels available for py3.6/py3.7 on OS X 10.12 & Windows. On other platforms you'll need a C++17-capable compiler.
pip install pysimdjson
If you're getting errors when installing from pip, there's probably no binary package available for your combination of platform & python version. As long as you have a C++17 compiler installed you can still use pip, you just need to provide a few extra compiler flags. The most common are:
-
gcc/clang:
CFLAGS="-march=native -std=c++17" pip install pysimdjson
-
msvc (Visual Studio 2017):
SET CL="/std:c++17 /arch:AVX2" pip install pysimdjson
or from git:
git clone https://github.com/TkTech/pysimdjson.git
cd pysimdjson
python setup.py install
Example
import simdjson
with open('sample.json', 'rb') as fin:
doc = simdjson.loads(fin.read())
However, this doesn't really gain you that much over, say, ujson. You're still
loading the entire document and converting the entire thing into a series of
Python objects which is very expensive. You can instead use items()
to pull
only part of a document into Python.
Example document:
{
"type": "search_results",
"count": 2,
"results": [
{"username": "bob"},
{"username": "tod"}
],
"error": {
"message": "All good captain"
}
}
And now lets try some queries...
import simdjson
with open('sample.json', 'rb') as fin:
# Calling ParsedJson with a document is a shortcut for
# calling pj.allocate_capacity(<size>) and pj.parse(<doc>). If you're
# parsing many JSON documents of similar sizes, you can allocate
# a large buffer just once and keep re-using it instead.
pj = simdjson.ParsedJson(fin.read())
pj.items('.type') #> "search_results"
pj.items('.count') #> 2
pj.items('.results[].username') #> ["bob", "tod"]
pj.items('.error.message') #> "All good captain"
AVX2
simdjson requires AVX2 support to function. Check to see if your OS/processor supports it:
- OS X:
sysctl -a | grep machdep.cpu.leaf7_features
- Linux:
grep avx2 /proc/cpuinfo
Low-level interface
You can use the low-level simdjson Iterator interface directly, just be aware that this interface can change any time. If you depend on it you should pin to a specific version of simdjson. You may need to use this interface if you're dealing with odd JSON, such as a document with repeated non-unique keys.
with open('sample.json', 'rb') as fin:
pj = simdjson.ParsedJson(fin.read())
iter = simdjson.Iterator(pj)
if iter.is_object():
if iter.down():
print(iter.get_string())
Early Benchmark
Comparing the built-in json module loads
on py3.7 to simdjson loads
.
File | json time |
pysimdjson time |
---|---|---|
jsonexamples/apache_builds.json |
0.09916733999999999 | 0.074089268 |
jsonexamples/canada.json |
5.305393378 | 1.6547515810000002 |
jsonexamples/citm_catalog.json |
1.3718639709999998 | 1.0438697340000003 |
jsonexamples/github_events.json |
0.04840242700000097 | 0.034239397999998644 |
jsonexamples/gsoc-2018.json |
1.5382746889999996 | 0.9597240750000005 |
jsonexamples/instruments.json |
0.24350973299999978 | 0.13639699600000021 |
jsonexamples/marine_ik.json |
4.505123285000002 | 2.8965093270000004 |
jsonexamples/mesh.json |
1.0325923849999974 | 0.38916503499999777 |
jsonexamples/mesh.pretty.json |
1.7129034710000006 | 0.46509220500000126 |
jsonexamples/numbers.json |
0.16577519699999854 | 0.04843887400000213 |
jsonexamples/random.json |
0.6930746310000018 | 0.6175370539999996 |
jsonexamples/twitter.json |
0.6069602610000011 | 0.41049074900000093 |
jsonexamples/twitterescaped.json |
0.7587005720000022 | 0.41576198399999953 |
jsonexamples/update-center.json |
0.5577604210000011 | 0.4961777420000004 |
Getting subsets of the document is significantly faster. For canada.json
getting .type
using the naive approach and the items()
appraoch, average
over N=100.
Python | Time |
---|---|
json.loads(canada_json)['type'] |
5.76244878 |
simdjson.loads(canada_json)['type'] |
1.5984486990000004 |
simdjson.ParsedJson(canada_json).items('.type') |
0.3949587819999998 |
This approach avoids creating Python objects for fields that aren't of interest. When you only care about a small part of the document, it will always be faster.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
File details
Details for the file pysimdjson-1.4.0-py3.7-win-amd64.egg
.
File metadata
- Download URL: pysimdjson-1.4.0-py3.7-win-amd64.egg
- Upload date:
- Size: 125.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf9d0610967b76b0db72cbc3a8bc976584365b558b5860777c4e6617f6258f5d |
|
MD5 | 94afb4ae64f5415c9505a98cd8b7b9fe |
|
BLAKE2b-256 | 3d6c4c159d535a82e8d77457f14790c4cc258b4019a266f9bce524ecc98f09c5 |
File details
Details for the file pysimdjson-1.4.0-py3.6-win-amd64.egg
.
File metadata
- Download URL: pysimdjson-1.4.0-py3.6-win-amd64.egg
- Upload date:
- Size: 125.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f6c5ceec0119c07bc1e92aa56986f3a56be97efc741e19b93a4df36251c0f625 |
|
MD5 | 172b43cd377afba54bf7db7954ffbafd |
|
BLAKE2b-256 | 6a00026aa60278ee97dadae77a7d975fa78f6886c1d54d178ddfc6ab5273dec0 |
File details
Details for the file pysimdjson-1.4.0-cp37-cp37m-win_amd64.whl
.
File metadata
- Download URL: pysimdjson-1.4.0-cp37-cp37m-win_amd64.whl
- Upload date:
- Size: 128.9 kB
- Tags: CPython 3.7m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1fd058e1e50a3e8d0c545057ea2dbc96bce000731aa3bf517e68296d8fbff089 |
|
MD5 | e6d56c4ab94ba67ea9970e0166b2b98c |
|
BLAKE2b-256 | 3fe8a92d134279decb48ae5f7979f3b3a811c6a5c5b7263b6684ecf013857b7f |
File details
Details for the file pysimdjson-1.4.0-cp36-cp36m-win_amd64.whl
.
File metadata
- Download URL: pysimdjson-1.4.0-cp36-cp36m-win_amd64.whl
- Upload date:
- Size: 129.0 kB
- Tags: CPython 3.6m, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4f8e970f33ca8a8096641c7e8df36d5a1f001bad8c50509eedea6571983d63b5 |
|
MD5 | c8557a6ea4dc8621420ffa33d41eb8c6 |
|
BLAKE2b-256 | eb09a67bb3d95ff2e521bd1f9fca49e6d6f5249e8085b626c33cd930423006c4 |