Pure-Python robots.txt parser with support for modern conventions
Project description
Protego
Overview
Protego is a pure-Python robots.txt
parser with support for modern conventions.
Requirements
- Python 2.7 or Python 3.5+
- Works on Linux, Windows, Mac OSX, BSD
Install
To install Protego, simply use pip:
pip install protego
Usage
>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'
Using Protego with Requests
>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']
Documentation
Class protego.Protego
:
Properties
sitemaps
{list_iterator
} A list of sitemaps specified inrobots.txt
.preferred_host
{string} Preferred host specified inrobots.txt
.
Methods
parse(robotstxt_body)
Parserobots.txt
and return a new instance ofprotego.Protego
.can_fetch(url, user_agent)
Return True if the user agent can fetch the URL, otherwise return False.crawl_delay(user_agent)
Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.request_rate(user_agent)
Return the request rate specified for the user agent as a named tupleRequestRate(requests, seconds, start_time, end_time)
. If nothing is specified, return None.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Protego-0.1.14.tar.gz
(3.2 MB
view details)
File details
Details for the file Protego-0.1.14.tar.gz
.
File metadata
- Download URL: Protego-0.1.14.tar.gz
- Upload date:
- Size: 3.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/2.7.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f2ac0020cb74ce536760db4045aeff2e81b360a8fecbd993c2d542695006d5be |
|
MD5 | 5754701436fdc5914b362400a82c947e |
|
BLAKE2b-256 | e978a6b2ee370a1bf989595b0c8b2d01a315333740d36c6dc4b95661f7f4010d |