Skip to main content

Pure-Python robots.txt parser with support for modern conventions

Project description

Supported Python Versions Build Status

Protego is a pure-Python robots.txt parser with support for modern conventions.

Install

To install Protego, simply use pip:

pip install protego

Usage

>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m                 # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'

Using Protego with Requests:

>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']

Comparison

The following table compares Protego to the most popular robots.txt parsers implemented in Python or featuring Python bindings:

Protego

RobotFileParser

Reppy

Robotexclusionrulesparser

Implementation language

Python

Python

C++

Python

Reference specification

Google

Martijn Koster’s 1996 draft

Wildcard support

Length-based precedence

Performance

+40%

+1300%

-25%

API Reference

Class protego.Protego:

Properties

  • sitemaps {list_iterator} A list of sitemaps specified in robots.txt.

  • preferred_host {string} Preferred host specified in robots.txt.

Methods

  • parse(robotstxt_body) Parse robots.txt and return a new instance of protego.Protego.

  • can_fetch(url, user_agent) Return True if the user agent can fetch the URL, otherwise return False.

  • crawl_delay(user_agent) Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.

  • request_rate(user_agent) Return the request rate specified for the user agent as a named tuple RequestRate(requests, seconds, start_time, end_time). If nothing is specified, return None.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Protego-0.2.1.tar.gz (3.2 MB view details)

Uploaded Source

Built Distribution

Protego-0.2.1-py2.py3-none-any.whl (8.2 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file Protego-0.2.1.tar.gz.

File metadata

  • Download URL: Protego-0.2.1.tar.gz
  • Upload date:
  • Size: 3.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for Protego-0.2.1.tar.gz
Algorithm Hash digest
SHA256 df666d4304dab774e2dc9feb208bb1ac8d71ea5ceec12f4c99eba30fbd642ff2
MD5 bfcefa2533a395040ac9129e8f69d8d3
BLAKE2b-256 3bcdde2f063d94115593bdff20d17f461522c90519fb3f5c46e6b4851e17b94e

See more details on using hashes here.

Provenance

File details

Details for the file Protego-0.2.1-py2.py3-none-any.whl.

File metadata

  • Download URL: Protego-0.2.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.22.0 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.1 keyring/18.0.1 rfc3986/2.0.0 colorama/0.4.3 CPython/3.8.10

File hashes

Hashes for Protego-0.2.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 04419b18f20e8909f1691c6b678392988271cc2a324a72f9663cb3af838b4bf7
MD5 7b31da10f4c46481b45d41a7639f124a
BLAKE2b-256 814d3e01f10d6dd2d35793711c2e27a07e547c6aec0ab8d3199bb83e68956fdb

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page