adblockparser

Parser for Adblock Plus rules

These details have been verified by PyPI

Maintainers

dangra kmike lopuhin pablohoffman scrapinghub scrapy

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Project description

adblockparser is a package for working with Adblock Plus filter rules. It can parse Adblock Plus filters and match URLs against them.

Installation

pip install adblockparser

For faster filter matching (2x-10x for a list of default EasyList filters) install pyre2 library. Version from github is required:

pip install git+https://github.com/axiak/pyre2.git#egg=re2

Usage

To learn about Adblock Plus filter syntax check these links:

Get filter rules somewhere: write them manually, read lines from a file downloaded from EasyList, etc.:

>>> raw_rules = [
...     "||ads.example.com^",
...     "@@||ads.example.com/notbanner^$~script",
... ]

Create AdblockRules instance from rule strings:

>>> from adblockparser import AdblockRules
>>> rules = AdblockRules(raw_rules)

Use this instance to check if an URL should be blocked or not:

>>> rules.should_block("http://ads.example.com")
True

Rules with options are ignored unless you pass a dict with options values:

>>> rules.should_block("http://ads.example.com/notbanner")
True
>>> rules.should_block("http://ads.example.com/notbanner", {'script': False})
False
>>> rules.should_block("http://ads.example.com/notbanner", {'script': True})
True

Consult with Adblock Plus docs for options description. These options allow to write filters that depend on some external information not available in URL itself.

Performance

Regex engines

AdblockRules class creates a huge regex to match filters that don’t use options. pyre2 library works better than stdlib’s re with such regexes. If you have pyre2 installed then pass use_re2 argument to make AdblockRules work faster:

>>> rules = AdblockRules(raw_rules, use_re2=True)  # doctest: +SKIP

Sometimes it fails and prints something like re2/dfa.cc:459: DFA out of memory: prog size 270515 mem 1713850 to stderr. Give re2 library more memory to fix that:

>>> rules = AdblockRules(raw_rules, use_re2=True, max_mem=512*1024*1024)  # doctest: +SKIP

Make sure you are not using re2 0.2.20 installed from PyPI, it doesn’t work. Install it from the github repo.

Parsing rules with options

Rules that have options are currently matched in a loop, one-by-one. Also, they are checked for compatibility with options passed by user: for example, if user didn’t pass ‘script’ option (with a True or False value), all rules involving script are discarded.

This is slow if you have thousands of such rules. To make it work faster, explicitly list all options you want to support in AdblockRules constructor, disable skipping of unsupported rules, and always pass a dict with all options to should_block method:

>>> rules = AdblockRules(
...    raw_rules,
...    supported_options=['script', 'domain'],
...    skip_unsupported_rules=False
... )
>>> params = {'script': False, 'domain': 'www.mystartpage.com'}
>>> rules.should_block("http://ads.example.com/notbanner", params)
False

This way rules with unsupported options will be filtered once, when AdblockRules instance is created.

Limitations

There are some known limitations of the current implementation:

element hiding rules are ignored;
matching URLs against a large number of filters can be slow-ish, especially if pyre2 is not installed and many filter options are enabled;
match-case filter option is not properly supported (it is ignored);
document filter option is not properly supported;
rules are not validated before parsing, so invalid rules may raise inconsistent exceptions or silently work incorrectly;
regular expressions in rules are not supported.

It is possible to remove all these limitations. Pull requests are welcome if you want to make it happen sooner!

Contributing

source code: https://github.com/scrapinghub/adblockparser
issue tracker: https://github.com/scrapinghub/adblockparser/issues

In order to run tests, install tox and type

tox

from the source checkout.

The license is MIT.

Project details

These details have been verified by PyPI

Maintainers

dangra kmike lopuhin pablohoffman scrapinghub scrapy

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language

Release history Release notifications | RSS feed

0.7

Oct 17, 2016

0.6

Sep 9, 2016

0.5

Mar 3, 2016

0.4

Mar 28, 2015

0.3

Jul 11, 2014

0.2

Feb 20, 2014

0.1.1

Feb 11, 2014

This version

0.1

Feb 7, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

adblockparser-0.1.tar.gz (9.5 kB view details)

Uploaded Feb 7, 2014 Source

File details

Details for the file adblockparser-0.1.tar.gz.

File metadata

Download URL: adblockparser-0.1.tar.gz
Upload date: Feb 7, 2014
Size: 9.5 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for adblockparser-0.1.tar.gz
Algorithm	Hash digest
SHA256	`77f4ff1d6b81f26b85d3a41722cbd7dee4b88f1a0cc656049fa8c78ad6b649ed`
MD5	`c6779d1621337b824ad70beee2900f3f`
BLAKE2b-256	`836e28e81289be1f6a9a2df024d41a13be40f5cf2d60cd638797677150d5cf7d`