Skip to main content

simdjson bindings for python

Project description

PyPI - License CircleCI branch AppVeyor branch

pysimdjson

Quick-n'dirty Python bindings for simdjson just to see if going down this path might yield some parse time improvements in real-world applications. So far, the results are promising, especially when only part of a document is of interest.

Bindings are currently tested on OS X, Linux, and Windows.

See the latest documentation at http://pysimdjson.tkte.ch.

Installation

There are binary wheels available for py3.6/py3.7 on OS X 10.12 & Windows. On other platforms you'll need a C++17-capable compiler.

pip install pysimdjson

If you're getting errors when installing from pip, there's probably no binary package available for your combination of platform & python version. As long as you have a C++17 compiler installed you can still use pip, you just need to provide a few extra compiler flags. The most common are:

  • gcc/clang: CFLAGS="-march=native -std=c++17" pip install pysimdjson

  • msvc (Visual Studio 2017):

    SET CL="/std:c++17 /arch:AVX2"
    pip install pysimdjson
    

or from git:

git clone https://github.com/TkTech/pysimdjson.git
cd pysimdjson
python setup.py install

Example

import simdjson

with open('sample.json', 'rb') as fin:
    doc = simdjson.loads(fin.read())

However, this doesn't really gain you that much over, say, ujson. You're still loading the entire document and converting the entire thing into a series of Python objects which is very expensive. You can instead use items() to pull only part of a document into Python.

Example document:

{
    "type": "search_results",
    "count": 2,
    "results": [
        {"username": "bob"},
        {"username": "tod"}
    ],
    "error": {
        "message": "All good captain"
    }
}

And now lets try some queries...

import simdjson

with open('sample.json', 'rb') as fin:
    # Calling ParsedJson with a document is a shortcut for
    # calling pj.allocate_capacity(<size>) and pj.parse(<doc>). If you're
    # parsing many JSON documents of similar sizes, you can allocate
    # a large buffer just once and keep re-using it instead.
    pj = simdjson.ParsedJson(fin.read())

    pj.items('.type') #> "search_results"
    pj.items('.count') #> 2
    pj.items('.results[].username') #> ["bob", "tod"]
    pj.items('.error.message') #> "All good captain"

AVX2

simdjson requires AVX2 support to function. Check to see if your OS/processor supports it:

  • OS X: sysctl -a | grep machdep.cpu.leaf7_features
  • Linux: grep avx2 /proc/cpuinfo

Low-level interface

You can use the low-level simdjson Iterator interface directly, just be aware that this interface can change any time. If you depend on it you should pin to a specific version of simdjson. You may need to use this interface if you're dealing with odd JSON, such as a document with repeated non-unique keys.

with open('sample.json', 'rb') as fin:
    pj = simdjson.ParsedJson(fin.read())
    iter = simdjson.Iterator(pj)
    if iter.is_object():
        if iter.down():
            print(iter.get_string())

Early Benchmark

Comparing the built-in json module loads on py3.7 to simdjson loads.

File json time pysimdjson time
jsonexamples/apache_builds.json 0.09916733999999999 0.074089268
jsonexamples/canada.json 5.305393378 1.6547515810000002
jsonexamples/citm_catalog.json 1.3718639709999998 1.0438697340000003
jsonexamples/github_events.json 0.04840242700000097 0.034239397999998644
jsonexamples/gsoc-2018.json 1.5382746889999996 0.9597240750000005
jsonexamples/instruments.json 0.24350973299999978 0.13639699600000021
jsonexamples/marine_ik.json 4.505123285000002 2.8965093270000004
jsonexamples/mesh.json 1.0325923849999974 0.38916503499999777
jsonexamples/mesh.pretty.json 1.7129034710000006 0.46509220500000126
jsonexamples/numbers.json 0.16577519699999854 0.04843887400000213
jsonexamples/random.json 0.6930746310000018 0.6175370539999996
jsonexamples/twitter.json 0.6069602610000011 0.41049074900000093
jsonexamples/twitterescaped.json 0.7587005720000022 0.41576198399999953
jsonexamples/update-center.json 0.5577604210000011 0.4961777420000004

Getting subsets of the document is significantly faster. For canada.json getting .type using the naive approach and the items() appraoch, average over N=100.

Python Time
json.loads(canada_json)['type'] 5.76244878
simdjson.loads(canada_json)['type'] 1.5984486990000004
simdjson.ParsedJson(canada_json).items('.type') 0.3949587819999998

This approach avoids creating Python objects for fields that aren't of interest. When you only care about a small part of the document, it will always be faster.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

pysimdjson-1.4.0-py3.7-win-amd64.egg (125.5 kB view details)

Uploaded Source

pysimdjson-1.4.0-py3.6-win-amd64.egg (125.6 kB view details)

Uploaded Source

pysimdjson-1.4.0-cp37-cp37m-win_amd64.whl (128.9 kB view details)

Uploaded CPython 3.7m Windows x86-64

pysimdjson-1.4.0-cp36-cp36m-win_amd64.whl (129.0 kB view details)

Uploaded CPython 3.6m Windows x86-64

File details

Details for the file pysimdjson-1.4.0-py3.7-win-amd64.egg.

File metadata

  • Download URL: pysimdjson-1.4.0-py3.7-win-amd64.egg
  • Upload date:
  • Size: 125.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2

File hashes

Hashes for pysimdjson-1.4.0-py3.7-win-amd64.egg
Algorithm Hash digest
SHA256 bf9d0610967b76b0db72cbc3a8bc976584365b558b5860777c4e6617f6258f5d
MD5 94afb4ae64f5415c9505a98cd8b7b9fe
BLAKE2b-256 3d6c4c159d535a82e8d77457f14790c4cc258b4019a266f9bce524ecc98f09c5

See more details on using hashes here.

File details

Details for the file pysimdjson-1.4.0-py3.6-win-amd64.egg.

File metadata

  • Download URL: pysimdjson-1.4.0-py3.6-win-amd64.egg
  • Upload date:
  • Size: 125.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for pysimdjson-1.4.0-py3.6-win-amd64.egg
Algorithm Hash digest
SHA256 f6c5ceec0119c07bc1e92aa56986f3a56be97efc741e19b93a4df36251c0f625
MD5 172b43cd377afba54bf7db7954ffbafd
BLAKE2b-256 6a00026aa60278ee97dadae77a7d975fa78f6886c1d54d178ddfc6ab5273dec0

See more details on using hashes here.

File details

Details for the file pysimdjson-1.4.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: pysimdjson-1.4.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 128.9 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2

File hashes

Hashes for pysimdjson-1.4.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 1fd058e1e50a3e8d0c545057ea2dbc96bce000731aa3bf517e68296d8fbff089
MD5 e6d56c4ab94ba67ea9970e0166b2b98c
BLAKE2b-256 3fe8a92d134279decb48ae5f7979f3b3a811c6a5c5b7263b6684ecf013857b7f

See more details on using hashes here.

File details

Details for the file pysimdjson-1.4.0-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: pysimdjson-1.4.0-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 129.0 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for pysimdjson-1.4.0-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 4f8e970f33ca8a8096641c7e8df36d5a1f001bad8c50509eedea6571983d63b5
MD5 c8557a6ea4dc8621420ffa33d41eb8c6
BLAKE2b-256 eb09a67bb3d95ff2e521bd1f9fca49e6d6f5249e8085b626c33cd930423006c4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page