Spider Middleware that allows a Scrapy Spider to filter requests.
Project description
Scrapy-link-filter
Spider Middleware that allows a Scrapy Spider to filter requests. There is similar functionality in the CrawlSpider already using Rules and in the RobotsTxtMiddleware, but there are twists. This middleware allows defining rules dinamically per spider, or job, or request.
Install
This project requires Python 3.6+ and pip. Using a virtual environment is strongly encouraged.
$ pip install git+https://github.com/croqaz/scrapy-link-filter
Usage
For the middleware to be enabled as a Spider Middleware, it must be added in the project settings.py
:
SPIDER_MIDDLEWARES = {
# maybe other Spider Middlewares ...
# can go after DepthMiddleware: 900
'scrapy_link_filter.middleware.LinkFilterMiddleware': 950,
}
Or, it can be enabled as a Downloader Middleware, in the project settings.py
:
DOWNLOADER_MIDDLEWARES = {
# maybe other Downloader Middlewares ...
# can go before RobotsTxtMiddleware: 100
'scrapy_link_filter.middleware.LinkFilterMiddleware': 50,
}
The rules must be defined either in the spider instance, in a spider.extract_rules
dict, or per request, in request.meta['extract_rules']
.
Internally, the extract_rules dict is converted into a LinkExtractor, which is used to match the requests.
Example of a specific allow filter:
extract_rules = {"allow_domains": "example.com", "allow": "/en/items/"}
Or a specific deny filter:
extract_rules = {
"deny_domains": ["whatever.com", "ignore.me"],
"deny": ["/privacy-policy/?$", "/about-?(us)?$"]
}
The allowed fields are:
allow_domains
anddeny_domains
- one, or more domains to specifically limit to, or specifically rejectallow
anddeny
- one, or more sub-strings, or patterns to specifically allow, or reject
License
BSD3 © Cristi Constantin.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-link-filter-0.1.1.tar.gz
.
File metadata
- Download URL: scrapy-link-filter-0.1.1.tar.gz
- Upload date:
- Size: 4.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb562565d5ce4241d751f594998183516524b7a1f34d4dd4c3bf78ac090e3ddd |
|
MD5 | 40db0993a46ff47482395c967187600a |
|
BLAKE2b-256 | 036c6e45ddd2686db395babbde8daaa5518bd13d6ff7335c2b8657808b102cae |
File details
Details for the file scrapy_link_filter-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: scrapy_link_filter-0.1.1-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 25093db548efc86823b6eb27db973c67b9877e0892713ceb676723d636be0b63 |
|
MD5 | a9d250b08095b6e707f8dad5fa301719 |
|
BLAKE2b-256 | 3eb534c59d25a104dd87c90fd3f22f53f1bbbba080a06f3b0a230370c4a7606a |