Pure-Python robots.txt parser with support for modern conventions
Project description
======= Protego
.. image:: https://img.shields.io/pypi/pyversions/protego.svg :target: https://pypi-hypernode.com/pypi/protego :alt: Supported Python Versions
.. image:: https://img.shields.io/travis/scrapy/protego/master.svg :target: https://travis-ci.org/scrapy/protego :alt: Build Status
Protego is a pure-Python robots.txt
parser with support for modern
conventions.
Install
To install Protego, simply use pip:
.. code-block:: none
pip install protego
Usage
from protego import Protego robotstxt = """ ... User-agent: * ... Disallow: / ... Allow: /about ... Allow: /account ... Disallow: /account/contact$ ... Disallow: /account/*/profile ... Crawl-delay: 4 ... Request-rate: 10/1m # 10 requests every 1 minute ... ... Sitemap: http://example.com/sitemap-index.xml ... Host: http://example.co.in ... """ rp = Protego.parse(robotstxt) rp.can_fetch("http://example.com/profiles", "mybot") False rp.can_fetch("http://example.com/about", "mybot") True rp.can_fetch("http://example.com/account", "mybot") True rp.can_fetch("http://example.com/account/myuser/profile", "mybot") False rp.can_fetch("http://example.com/account/contact", "mybot") False rp.crawl_delay("mybot") 4.0 rp.request_rate("mybot") RequestRate(requests=10, seconds=60, start_time=None, end_time=None) list(rp.sitemaps) ['http://example.com/sitemap-index.xml'] rp.preferred_host 'http://example.co.in'
Using Protego with Requests_:
from protego import Protego import requests r = requests.get("https://google.com/robots.txt") rp = Protego.parse(r.text) rp.can_fetch("https://google.com/search", "mybot") False rp.can_fetch("https://google.com/search/about", "mybot") True list(rp.sitemaps) ['https://www.google.com/sitemap.xml']
.. _Requests: https://3.python-requests.org/
Comparison
The following table compares Protego to the most popular robots.txt
parsers
implemented in Python or featuring Python bindings:
+----------------------------+---------+-----------------+--------+---------------------------+
| | Protego | RobotFileParser | Reppy | Robotexclusionrulesparser |
+============================+=========+=================+========+===========================+
| Implementation language | Python | Python | C++ | Python |
+----------------------------+---------+-----------------+--------+---------------------------+
| Reference specification | Google_ | Martijn Koster’s 1996 draft
_ |
+----------------------------+---------+-----------------+--------+---------------------------+
| Wildcard support
_ | ✓ | | ✓ | ✓ |
+----------------------------+---------+-----------------+--------+---------------------------+
| Length-based precedence
_ | ✓ | | ✓ | |
+----------------------------+---------+-----------------+--------+---------------------------+
| Performance_ | | +40% | +1300% | -25% |
+----------------------------+---------+-----------------+--------+---------------------------+
.. _Google: https://developers.google.com/search/reference/robots_txt .. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines .. _Martijn Koster’s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt .. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/ .. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values
API Reference
Class protego.Protego
:
Properties
-
sitemaps
{list_iterator
} A list of sitemaps specified inrobots.txt
. -
preferred_host
{string} Preferred host specified inrobots.txt
.
Methods
-
parse(robotstxt_body)
Parserobots.txt
and return a new instance ofprotego.Protego
. -
can_fetch(url, user_agent)
Return True if the user agent can fetch the URL, otherwise returnFalse
. -
crawl_delay(user_agent)
Return the crawl delay specified for the user agent as a float. If nothing is specified, returnNone
. -
request_rate(user_agent)
Return the request rate specified for the user agent as a named tupleRequestRate(requests, seconds, start_time, end_time)
. If nothing is specified, returnNone
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file Protego-0.2.0.tar.gz
.
File metadata
- Download URL: Protego-0.2.0.tar.gz
- Upload date:
- Size: 3.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b99605617a42ce07b049e0b9d6dc3ee5efb9924f26fd115eaafbc8f9a37470e |
|
MD5 | 66b05fdd2c5a7f1f820b2edea6bc8610 |
|
BLAKE2b-256 | aafbdd1e88dd52b1fdcc2dc24f4f3618927b57f3232dcaa478ee6c4e9e5022aa |
Provenance
File details
Details for the file Protego-0.2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: Protego-0.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 8.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cd4e4f2a0a27471bcc26ddbfcab9744fa24b998b8c446fe0b2427c2e07348048 |
|
MD5 | 88aa6e72fca862824236a4a87af44fa1 |
|
BLAKE2b-256 | bd89c6b1065f70e7d55a13ef29322d30a2966af3090c5fa33027b389d5d1fe49 |