Skip to main content

Pure-Python robots.txt parser with support for modern conventions

Project description

Supported Python Versions CI

Protego is a pure-Python robots.txt parser with support for modern conventions.

Install

To install Protego, simply use pip:

pip install protego

Usage

>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m                 # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'

Using Protego with Requests:

>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']

Comparison

The following table compares Protego to the most popular robots.txt parsers implemented in Python or featuring Python bindings:

Protego

RobotFileParser

Reppy

Robotexclusionrulesparser

Implementation language

Python

Python

C++

Python

Reference specification

Google

Martijn Koster’s 1996 draft

Wildcard support

Length-based precedence

Performance

+40%

+1300%

-25%

API Reference

Class protego.Protego:

Properties

  • sitemaps {list_iterator} A list of sitemaps specified in robots.txt.

  • preferred_host {string} Preferred host specified in robots.txt.

Methods

  • parse(robotstxt_body) Parse robots.txt and return a new instance of protego.Protego.

  • can_fetch(url, user_agent) Return True if the user agent can fetch the URL, otherwise return False.

  • crawl_delay(user_agent) Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.

  • request_rate(user_agent) Return the request rate specified for the user agent as a named tuple RequestRate(requests, seconds, start_time, end_time). If nothing is specified, return None.

  • visit_time(user_agent) Return the visit time specified for the user agent as a named tuple VisitTime(start_time, end_time). If nothing is specified, return None.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Protego-0.3.0.tar.gz (3.2 MB view details)

Uploaded Source

Built Distribution

Protego-0.3.0-py2.py3-none-any.whl (8.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file Protego-0.3.0.tar.gz.

File metadata

  • Download URL: Protego-0.3.0.tar.gz
  • Upload date:
  • Size: 3.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for Protego-0.3.0.tar.gz
Algorithm Hash digest
SHA256 04228bffde4c6bcba31cf6529ba2cfd6e1b70808fdc1d2cb4301be6b28d6c568
MD5 28b1ca092bda685de782a2b22d77d95a
BLAKE2b-256 2c7efc128cfd3bb8e081165fcdaad44ab5fff73678fbebc51f79f733c57c5295

See more details on using hashes here.

Provenance

File details

Details for the file Protego-0.3.0-py2.py3-none-any.whl.

File metadata

  • Download URL: Protego-0.3.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for Protego-0.3.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 db38f6a945839d8162a4034031a21490469566a2726afb51d668497c457fb0aa
MD5 31c63f418cc32df07cd604e6373a6721
BLAKE2b-256 bc1614fd1ecdece2e1d87279fc09fbd2d55bae5fa033783c3547af631c74d718

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page