Skip to main content

Pure-Python robots.txt parser with support for modern conventions

Project description

======= Protego

.. image:: https://img.shields.io/pypi/pyversions/protego.svg :target: https://pypi-hypernode.com/pypi/protego :alt: Supported Python Versions

.. image:: https://img.shields.io/travis/scrapy/protego/master.svg :target: https://travis-ci.org/scrapy/protego :alt: Build Status

Protego is a pure-Python robots.txt parser with support for modern conventions.

Install

To install Protego, simply use pip:

.. code-block:: none

pip install protego

Usage

from protego import Protego robotstxt = """ ... User-agent: * ... Disallow: / ... Allow: /about ... Allow: /account ... Disallow: /account/contact$ ... Disallow: /account/*/profile ... Crawl-delay: 4 ... Request-rate: 10/1m # 10 requests every 1 minute ... ... Sitemap: http://example.com/sitemap-index.xml ... Host: http://example.co.in ... """ rp = Protego.parse(robotstxt) rp.can_fetch("http://example.com/profiles", "mybot") False rp.can_fetch("http://example.com/about", "mybot") True rp.can_fetch("http://example.com/account", "mybot") True rp.can_fetch("http://example.com/account/myuser/profile", "mybot") False rp.can_fetch("http://example.com/account/contact", "mybot") False rp.crawl_delay("mybot") 4.0 rp.request_rate("mybot") RequestRate(requests=10, seconds=60, start_time=None, end_time=None) list(rp.sitemaps) ['http://example.com/sitemap-index.xml'] rp.preferred_host 'http://example.co.in'

Using Protego with Requests_:

from protego import Protego import requests r = requests.get("https://google.com/robots.txt") rp = Protego.parse(r.text) rp.can_fetch("https://google.com/search", "mybot") False rp.can_fetch("https://google.com/search/about", "mybot") True list(rp.sitemaps) ['https://www.google.com/sitemap.xml']

.. _Requests: https://3.python-requests.org/

Comparison

The following table compares Protego to the most popular robots.txt parsers implemented in Python or featuring Python bindings:

+----------------------------+---------+-----------------+--------+---------------------------+ | | Protego | RobotFileParser | Reppy | Robotexclusionrulesparser | +============================+=========+=================+========+===========================+ | Implementation language | Python | Python | C++ | Python | +----------------------------+---------+-----------------+--------+---------------------------+ | Reference specification | Google_ | Martijn Koster’s 1996 draft_ | +----------------------------+---------+-----------------+--------+---------------------------+ | Wildcard support_ | ✓ | | ✓ | ✓ | +----------------------------+---------+-----------------+--------+---------------------------+ | Length-based precedence_ | ✓ | | ✓ | | +----------------------------+---------+-----------------+--------+---------------------------+ | Performance_ | | +40% | +1300% | -25% | +----------------------------+---------+-----------------+--------+---------------------------+

.. _Google: https://developers.google.com/search/reference/robots_txt .. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines .. _Martijn Koster’s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt .. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/ .. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values

API Reference

Class protego.Protego:

Properties

  • sitemaps {list_iterator} A list of sitemaps specified in robots.txt.

  • preferred_host {string} Preferred host specified in robots.txt.

Methods

  • parse(robotstxt_body) Parse robots.txt and return a new instance of protego.Protego.

  • can_fetch(url, user_agent) Return True if the user agent can fetch the URL, otherwise return False.

  • crawl_delay(user_agent) Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.

  • request_rate(user_agent) Return the request rate specified for the user agent as a named tuple RequestRate(requests, seconds, start_time, end_time). If nothing is specified, return None.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Protego-0.2.0.tar.gz (3.2 MB view details)

Uploaded Source

Built Distribution

Protego-0.2.0-py2.py3-none-any.whl (8.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file Protego-0.2.0.tar.gz.

File metadata

  • Download URL: Protego-0.2.0.tar.gz
  • Upload date:
  • Size: 3.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for Protego-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1b99605617a42ce07b049e0b9d6dc3ee5efb9924f26fd115eaafbc8f9a37470e
MD5 66b05fdd2c5a7f1f820b2edea6bc8610
BLAKE2b-256 aafbdd1e88dd52b1fdcc2dc24f4f3618927b57f3232dcaa478ee6c4e9e5022aa

See more details on using hashes here.

Provenance

File details

Details for the file Protego-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: Protego-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for Protego-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 cd4e4f2a0a27471bcc26ddbfcab9744fa24b998b8c446fe0b2427c2e07348048
MD5 88aa6e72fca862824236a4a87af44fa1
BLAKE2b-256 bd89c6b1065f70e7d55a13ef29322d30a2966af3090c5fa33027b389d5d1fe49

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page