Skip to main content

Sitemap generation for Python, with support for crawling ASGI web apps directly.

Project description

sitemaps

Build Status Coverage Python versions Package version

Sitemaps is a Python command line tool and library to generate sitemap files by crawling web servers or ASGI apps. Sitemaps is powered by HTTPX and anyio.

Note: This is alpha software. Be sure to pin your dependencies to the latest minor release.

Quickstart

Live server

python -m sitemaps https://example.org

Example output:

$ cat sitemap.xml
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <url><loc>https://example.org/</loc><changefreq>daily</changefreq></url>
</urlset>

ASGI app

HTTP requests are issued to the ASGI app directly. The target URL is only used as a base URL for building sitemap entries.

python -m sitemaps --asgi '<module>:<attribute>' http://testserver

Check mode

Useful to verify that the sitemap is in sync (e.g. as part of CI checks):

python -m sitemaps --check [...]

Features

  • Support for crawling any live web server.
  • Support for crawling an ASGI app directly (i.e. without having to spin up a server).
  • --check mode.
  • Invoke from the command line, or use the programmatic async API (supports asyncio and trio).
  • Fully type annotated.
  • 100% test coverage.

Installation

Install with pip:

$ pip install sitemaps

Sitemaps requires Python 3.7+.

Command line reference

$ python -m sitemaps --help
usage: __main__.py [-h] [-o OUTPUT] [-I IGNORE_PATH_PREFIX] [--asgi ASGI]
                   [--max-concurrency MAX_CONCURRENCY] [--check]
                   target

positional arguments:
  target                The base URL used to crawl the website and build
                        sitemap URL tags.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file path.
  -I IGNORE_PATH_PREFIX, --ignore-path-prefix IGNORE_PATH_PREFIX
                        Ignore URLs for this path prefix. Can be used multiple
                        times.
  --asgi ASGI           Path to an ASGI app, formatted as
                        '<module>:<attribute>'.
  --max-concurrency MAX_CONCURRENCY
                        Maximum number of URLs to process concurrently.
  --check               Compare existing output and fail if computed XML
                        differs.

Programmatic API

Live server

import sitemaps

async def main():
    urls = await sitemaps.crawl("https://example.org")
    with open("sitemap.xml", "w") as f:
        f.write(sitemaps.make_xml(urls))

ASGI app

import httpx
import sitemaps

from .app import app

async def main():
    async with httpx.AsyncClient(app=app) as client:
        urls = await sitemaps.crawl("http://testserver", client=client)

    with open("sitemap.xml", "w") as f:
        f.write(sitemaps.make_xml(urls))

Customizing URL tags

By default, .make_xml() generates <url> tags with a daily change frequency. You can customize the generation of URL tags by passing a custom urltag callable:

from urllib.parse import urlsplit

def urltag(url):
    path = urlsplit(url).path
    changefreq = "monthly" if path.startswith("/reports") else "daily"
    return f"<url><loc>{url}</loc><changefreq>{changefreq}</changefreq></url>"

async def main():
    urls = await sitemaps.crawl(...)
    with open("sitemap.xml", "w") as f:
      f.write(sitemaps.make_xml(urls, urltag=urltag))

License

MIT

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

0.1.0 - 2020-05-31

Added

  • Initial implementation: CLI and programmatic async API.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sitemaps-0.1.0.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

sitemaps-0.1.0-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file sitemaps-0.1.0.tar.gz.

File metadata

  • Download URL: sitemaps-0.1.0.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2

File hashes

Hashes for sitemaps-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9291ab4c98e7acfe8d7aa36b223ba48dcc887fe87f2012181c0183e800315dee
MD5 3f6a8a615304a4ff2b157900e9cab081
BLAKE2b-256 ed44f637d83d0f0ab4e7784978050bab8cd1b84f54dcbcf2769e3aaf3754bb12

See more details on using hashes here.

File details

Details for the file sitemaps-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sitemaps-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2

File hashes

Hashes for sitemaps-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f7c0801a209fd4b20fb3251d70f5275714023ad5d44560762a8bc07a904939f5
MD5 2f08548e4337e21c60a35957e054a1bd
BLAKE2b-256 6450f495e68843243141c67e0bb7f07680850a98f48293c59cdd97041f6c250e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page