Skip to main content

simdjson bindings for python

Project description

PyPI - License CircleCI branch AppVeyor branch

pysimdjson

Quick-n'dirty Python bindings for simdjson just to see if going down this path might yield some parse time improvements in real-world applications. So far, the results are promising, especially when only part of a document is of interest.

These bindings are currently only tested on OS X & Windows, but should work everywhere simdjson does although you'll probably have to tweak your build flags.

See the latest documentation at http://pysimdjson.tkte.ch.

Installation

There are binary wheels available for py3.6/py3.7 on OS X 10.12 & Windows. On other platforms you'll need a C++17-capable compiler.

pip install pysimdjson

or from source:

git clone https://github.com/TkTech/pysimdjson.git
cd pysimdjson
python setup.py install

Example

import simdjson

with open('sample.json', 'rb') as fin:
    doc = simdjson.loads(fin.read())

However, this doesn't really gain you that much over, say, ujson. You're still loading the entire document and converting the entire thing into a series of Python objects which is very expensive. You can instead use items() to pull only part of a document into Python.

Example document:

{
    "type": "search_results",
    "count": 2,
    "results": [
        {"username": "bob"},
        {"username": "tod"}
    ],
    "error": {
        "message": "All good captain"
    }
}

And now lets try some queries...

import simdjson

with open('sample.json', 'rb') as fin:
    # Calling ParsedJson with a document is a shortcut for
    # calling pj.allocate_capacity(<size>) and pj.parse(<doc>). If you're
    # parsing many JSON documents of similar sizes, you can allocate
    # a large buffer just once and keep re-using it instead.
    pj = simdjson.ParsedJson(fin.read())

    pj.items('.type') #> "search_results"
    pj.items('.count') #> 2
    pj.items('.results[].username') #> ["bob", "tod"]
    pj.items('.error.message') #> "All good captain"

AVX2

simdjson requires AVX2 support to function. Check to see if your OS/processor supports it:

  • OS X: sysctl -a | grep machdep.cpu.leaf7_features
  • Linux: grep avx2 /proc/cpuinfo

Low-level interface

You can use the low-level simdjson Iterator interface directly, just be aware that this interface can change any time. If you depend on it you should pin to a specific version of simdjson. You may need to use this interface if you're dealing with odd JSON, such as a document with repeated non-unique keys.

with open('sample.json', 'rb') as fin:
    pj = simdjson.ParsedJson(fin.read())
    iter = simdjson.Iterator(pj)
    if iter.is_object():
        if iter.down():
            print(iter.get_string())

Early Benchmark

Comparing the built-in json module loads on py3.7 to simdjson loads.

File json time pysimdjson time
jsonexamples/apache_builds.json 0.09916733999999999 0.074089268
jsonexamples/canada.json 5.305393378 1.6547515810000002
jsonexamples/citm_catalog.json 1.3718639709999998 1.0438697340000003
jsonexamples/github_events.json 0.04840242700000097 0.034239397999998644
jsonexamples/gsoc-2018.json 1.5382746889999996 0.9597240750000005
jsonexamples/instruments.json 0.24350973299999978 0.13639699600000021
jsonexamples/marine_ik.json 4.505123285000002 2.8965093270000004
jsonexamples/mesh.json 1.0325923849999974 0.38916503499999777
jsonexamples/mesh.pretty.json 1.7129034710000006 0.46509220500000126
jsonexamples/numbers.json 0.16577519699999854 0.04843887400000213
jsonexamples/random.json 0.6930746310000018 0.6175370539999996
jsonexamples/twitter.json 0.6069602610000011 0.41049074900000093
jsonexamples/twitterescaped.json 0.7587005720000022 0.41576198399999953
jsonexamples/update-center.json 0.5577604210000011 0.4961777420000004

Getting subsets of the document is significantly faster. For canada.json getting .type using the naive approach and the items() appraoch, average over N=100.

Python Time
json.loads(canada_json)['type'] 5.76244878
simdjson.loads(canada_json)['type'] 1.5984486990000004
simdjson.ParsedJson(canada_json).items('.type') 0.3949587819999998

This approach avoids creating Python objects for fields that aren't of interest. When you only care about a small part of the document, it will always be faster.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysimdjson-1.2.2.tar.gz (207.5 kB view details)

Uploaded Source

Built Distributions

pysimdjson-1.2.2-py3.7-win-amd64.egg (118.6 kB view details)

Uploaded Source

pysimdjson-1.2.2-py3.6-win-amd64.egg (118.8 kB view details)

Uploaded Source

pysimdjson-1.2.2-cp37-cp37m-win_amd64.whl (122.5 kB view details)

Uploaded CPython 3.7m Windows x86-64

pysimdjson-1.2.2-cp37-cp37m-macosx_10_12_x86_64.whl (127.2 kB view details)

Uploaded CPython 3.7m macOS 10.12+ x86-64

pysimdjson-1.2.2-cp36-cp36m-win_amd64.whl (122.6 kB view details)

Uploaded CPython 3.6m Windows x86-64

File details

Details for the file pysimdjson-1.2.2.tar.gz.

File metadata

  • Download URL: pysimdjson-1.2.2.tar.gz
  • Upload date:
  • Size: 207.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.0

File hashes

Hashes for pysimdjson-1.2.2.tar.gz
Algorithm Hash digest
SHA256 3ffc43cbd46c07133318b9b4eba8453d2520daa876923b36e53d267efbcf837f
MD5 172a4ecc5330a80a8c71db7796a61e76
BLAKE2b-256 00e698b7eeb8e82b19773c1b0c404706413b18f366c82a25f0dc563d25a7f8f2

See more details on using hashes here.

File details

Details for the file pysimdjson-1.2.2-py3.7-win-amd64.egg.

File metadata

  • Download URL: pysimdjson-1.2.2-py3.7-win-amd64.egg
  • Upload date:
  • Size: 118.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2

File hashes

Hashes for pysimdjson-1.2.2-py3.7-win-amd64.egg
Algorithm Hash digest
SHA256 9dc88af9cec6d64ff3e64c80fff2700763a75c2b54acde9b6446f3b11cd7289c
MD5 ca3f6618045dae19d1b7ac2effba94c7
BLAKE2b-256 f0115cee1c157f1e9a523031c94378c2a74df85ccd5b54ac44e1aad6416e8e4c

See more details on using hashes here.

File details

Details for the file pysimdjson-1.2.2-py3.6-win-amd64.egg.

File metadata

  • Download URL: pysimdjson-1.2.2-py3.6-win-amd64.egg
  • Upload date:
  • Size: 118.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for pysimdjson-1.2.2-py3.6-win-amd64.egg
Algorithm Hash digest
SHA256 63fbae48689cb6b5280ef91912d81ff7dd472b4bb1401a0d313c7a8046aaed2d
MD5 9af6ed5d463ea24c35e59efd16427999
BLAKE2b-256 703055737a3e292c1df4e90802ea1200336b81c508b9b501e3a9a3af4d89a2cb

See more details on using hashes here.

File details

Details for the file pysimdjson-1.2.2-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: pysimdjson-1.2.2-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 122.5 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2

File hashes

Hashes for pysimdjson-1.2.2-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 20f6b723a41c7650031f157db696984d4d08df65c4f9fe5e897074c373434b44
MD5 18ebda91cb5a3f83a3acd37379dfa4d6
BLAKE2b-256 0a9d77803797f976fe23094b1db287cc33ed4fba9bcceba7d7d7e9a00ec2b8c7

See more details on using hashes here.

File details

Details for the file pysimdjson-1.2.2-cp37-cp37m-macosx_10_12_x86_64.whl.

File metadata

  • Download URL: pysimdjson-1.2.2-cp37-cp37m-macosx_10_12_x86_64.whl
  • Upload date:
  • Size: 127.2 kB
  • Tags: CPython 3.7m, macOS 10.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.0

File hashes

Hashes for pysimdjson-1.2.2-cp37-cp37m-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a2ec332d66daf9aa69e0af4b33efa64b85cf3aa424cc53cf89168819f4dc77b0
MD5 1e053c575d242a2c1868e215292bbf6f
BLAKE2b-256 4b1ff28d41d5b6d73071c0dbe415218fea4f2ac58a23f4348b8037176633b536

See more details on using hashes here.

File details

Details for the file pysimdjson-1.2.2-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: pysimdjson-1.2.2-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 122.6 kB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.6.8

File hashes

Hashes for pysimdjson-1.2.2-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 362d05bc1c460b4cc915e3d127a945169e72fbe01d1995116b19652a2a584d7b
MD5 fa6f1d723d85e97430b3d5bf770da690
BLAKE2b-256 20ca1a099edffc4ca7a73dfdcb54edcbd0941ad3312124c56a8e95e0ca72d797

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page