Scrapy Middleware that allows a Scrapy Spider to filter requests.

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Software Development

Project description

Scrapy-link-filter

Spider Middleware that allows a Scrapy Spider to filter requests. There is similar functionality in the CrawlSpider already using Rules and in the RobotsTxtMiddleware, but there are twists. This middleware allows defining rules dinamically per request, or as spider arguments instead of project settings.

Install

This project requires Python 3.6+ and pip. Using a virtual environment is strongly encouraged.

$ pip install git+https://github.com/croqaz/scrapy-link-filter

Usage

For the middleware to be enabled as a Spider Middleware, it must be added in the project settings.py:

SPIDER_MIDDLEWARES = {
    # maybe other Spider Middlewares ...
    # can go after DepthMiddleware: 900
    'scrapy_link_filter.middleware.LinkFilterMiddleware': 950,
}

Or, it can be enabled as a Downloader Middleware, in the project settings.py:

DOWNLOADER_MIDDLEWARES = {
    # maybe other Downloader Middlewares ...
    # can go before RobotsTxtMiddleware: 100
    'scrapy_link_filter.middleware.LinkFilterMiddleware': 50,
}

The rules must be defined either in the spider instance, in a spider.extract_rules dict, or per request, in request.meta['extract_rules']. Internally, the extract_rules dict is converted into a LinkExtractor, which is used to match the requests.

Note that the URL matching is case-sensitive by default, which works in most cases. To enable case-insensitive matching, you can specify a "(?i)" inline flag in the beggining of each "allow", or "deny" rule that needs to be case-insensitive.

Example of a specific allow filter, on a spider instance:

from scrapy.spiders import Spider

class MySpider(Spider):
    extract_rules = {"allow_domains": "example.com", "allow": "/en/items/"}

Or a specific deny filter, inside a request meta:

request.meta['extract_rules'] = {
    "deny_domains": ["whatever.com", "ignore.me"],
    "deny": ["/privacy-policy/?$", "/about-?(us)?$"]
}

The possible fields are:

allow_domains and deny_domains - one, or more domains to specifically limit to, or specifically reject
allow and deny - one, or more sub-strings, or patterns to specifically allow, or reject

All fields can be defined as string, list, set, or tuple.

License

BSD3 © Cristi Constantin.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 4 - Beta
Environment
- Console
Framework
- Scrapy
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- Software Development

Release history Release notifications | RSS feed

This version

0.2.0

Dec 12, 2019

0.1.1

Dec 12, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-link-filter-0.2.0.tar.gz (5.4 kB view details)

Uploaded Dec 12, 2019 Source

Built Distribution

scrapy_link_filter-0.2.0-py3-none-any.whl (6.2 kB view details)

Uploaded Dec 12, 2019 Python 3

File details

Details for the file scrapy-link-filter-0.2.0.tar.gz.

File metadata

Download URL: scrapy-link-filter-0.2.0.tar.gz
Upload date: Dec 12, 2019
Size: 5.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.9

File hashes

Hashes for scrapy-link-filter-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`64bf701cbbbc9f51dad47094c1effbf9ca47b6d6e9f54c0f64cedefcbbbc72e4`
MD5	`c46a8775512a159c7e5898e48c779229`
BLAKE2b-256	`3d1c175ecef969380c969abbd90f1063b8375fe68131f8a2c0c5995beb0b0a84`

See more details on using hashes here.

File details

Details for the file scrapy_link_filter-0.2.0-py3-none-any.whl.

File metadata

Download URL: scrapy_link_filter-0.2.0-py3-none-any.whl
Upload date: Dec 12, 2019
Size: 6.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.9

File hashes

Hashes for scrapy_link_filter-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f1f6c25569a765945a331daac3929731cd1f4e07ad129705ee53d0b6e65f9d36`
MD5	`d8319c5b016dd2c6f0a4f3e81e233b9a`
BLAKE2b-256	`a8d549b7e0fcd23809a513a59d9224812355d2b8424b672b3963e7d8155e4ba9`