Fast C based HTML 5 parsing for python
Project description
A fast implementation of the HTML 5 parsing spec. Parsing is done in C using a variant of the gumbo parser. The gumbo parse tree is then transformed into an lxml tree, also in C, yielding parse times that can be a thirtieth of the html5lib parse times. That is a speedup of 30x.
Installation
Unix
On a Unix-y system, with a working compiler, simply run:
pip install --no-binary lxml html5-parser
It is important that lxml is installed with the –no-binary flags. This is because without it, lxml uses a static copy of libxml2. For html5-parser to work it must use the same libxml2 implementation as lxml. This is only possible if libxml2 is loaded dynamically.
You can setup html5-parser to run from a source checkout as follows:
git clone https://github.com/kovidgoyal/html5-parser && cd html5-parser
pip install --no-binary lxml 'lxml>=3.8.0' --user
python setup.py develop --user
Windows
On Windows, installation is a little more involved. There is a 200 line script that is used to install html5-parser and all its dependencies on the windows continuous integration server. Using that script installation can be done by running the following commands in a Visual Studio 2015 Command prompt:
python.exe win-ci.py install_deps
python.exe win-ci.py test
This will install all dependencies and html5-parser in the sw sub-directory. You will need to add sw\bin to PATH and sw\python\Lib\site-packages to PYTHONPATH. Or copy the files into your system python’s directories.
Benchmarking
There is a benchmark script named benchmark.py that compares the parse times for parsing a large (~ 5.7MB) HTML document in html5lib and html5-parser. The results on my system show a speedup of 28x. The output from the script on my system is:
Testing with HTML file of 5,956,815 bytes
Parsing repeatedly with html5-parser
html5-parser took an average of : 0.491 seconds to parse it
Parsing repeatedly with html5lib
html5lib took an average of : 13.744 seconds to parse it
There is further potential for speedup. Currently the gumbo subsystem uses its own cache for tag and attribute names and the libxml2 sub-system uses its own cache. Unifying the two to use the libxml2 cache should yield significant performance and memory consumption gains.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file html5-parser-0.2.1.tar.gz
.
File metadata
- Download URL: html5-parser-0.2.1.tar.gz
- Upload date:
- Size: 238.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5f5a315391e3489f32aed6cec8acc2f5d361751dcfe502e4a700f1979154b859 |
|
MD5 | 768a2fd4b9f421cf2bcd5a729d9d1554 |
|
BLAKE2b-256 | b3c7c5c5e2de4000647295e6a79270ae93d599b4be021ed7a60f53fa1a1b5b54 |