Sitemap generation for Python, with support for crawling ASGI web apps directly.
Project description
sitemaps
Sitemaps is a Python command line tool and library to generate sitemap files by crawling web servers or ASGI apps. Sitemaps is powered by HTTPX and anyio.
Note: This is alpha software. Be sure to pin your dependencies to the latest minor release.
Quickstart
Live server
python -m sitemaps https://example.org
Example output:
$ cat sitemap.xml
<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url><loc>https://example.org/</loc><changefreq>daily</changefreq></url>
</urlset>
ASGI app
HTTP requests are issued to the ASGI app directly. The target URL is only used as a base URL for building sitemap entries.
python -m sitemaps --asgi '<module>:<attribute>' http://testserver
Check mode
Useful to verify that the sitemap is in sync (e.g. as part of CI checks):
python -m sitemaps --check [...]
Features
- Support for crawling any live web server.
- Support for crawling an ASGI app directly (i.e. without having to spin up a server).
--check
mode.- Invoke from the command line, or use the programmatic async API (supports asyncio and trio).
- Fully type annotated.
- 100% test coverage.
Installation
Install with pip:
$ pip install sitemaps
Sitemaps requires Python 3.7+.
Command line reference
$ python -m sitemaps --help
usage: __main__.py [-h] [-o OUTPUT] [-I IGNORE_PATH_PREFIX] [--asgi ASGI]
[--max-concurrency MAX_CONCURRENCY] [--check]
target
positional arguments:
target The base URL used to crawl the website and build
sitemap URL tags.
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Output file path.
-I IGNORE_PATH_PREFIX, --ignore-path-prefix IGNORE_PATH_PREFIX
Ignore URLs for this path prefix. Can be used multiple
times.
--asgi ASGI Path to an ASGI app, formatted as
'<module>:<attribute>'.
--max-concurrency MAX_CONCURRENCY
Maximum number of URLs to process concurrently.
--check Compare existing output and fail if computed XML
differs.
Programmatic API
Live server
import sitemaps
async def main():
urls = await sitemaps.crawl("https://example.org")
with open("sitemap.xml", "w") as f:
f.write(sitemaps.make_xml(urls))
ASGI app
import httpx
import sitemaps
from .app import app
async def main():
async with httpx.AsyncClient(app=app) as client:
urls = await sitemaps.crawl("http://testserver", client=client)
with open("sitemap.xml", "w") as f:
f.write(sitemaps.make_xml(urls))
Customizing URL tags
By default, .make_xml()
generates <url>
tags with a daily
change frequency. You can customize the generation of URL tags by passing a custom urltag
callable:
from urllib.parse import urlsplit
def urltag(url):
path = urlsplit(url).path
changefreq = "monthly" if path.startswith("/reports") else "daily"
return f"<url><loc>{url}</loc><changefreq>{changefreq}</changefreq></url>"
async def main():
urls = await sitemaps.crawl(...)
with open("sitemap.xml", "w") as f:
f.write(sitemaps.make_xml(urls, urltag=urltag))
License
MIT
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
0.1.0 - 2020-05-31
Added
- Initial implementation: CLI and programmatic async API.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sitemaps-0.1.0.tar.gz
.
File metadata
- Download URL: sitemaps-0.1.0.tar.gz
- Upload date:
- Size: 13.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9291ab4c98e7acfe8d7aa36b223ba48dcc887fe87f2012181c0183e800315dee |
|
MD5 | 3f6a8a615304a4ff2b157900e9cab081 |
|
BLAKE2b-256 | ed44f637d83d0f0ab4e7784978050bab8cd1b84f54dcbcf2769e3aaf3754bb12 |
File details
Details for the file sitemaps-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: sitemaps-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7c0801a209fd4b20fb3251d70f5275714023ad5d44560762a8bc07a904939f5 |
|
MD5 | 2f08548e4337e21c60a35957e054a1bd |
|
BLAKE2b-256 | 6450f495e68843243141c67e0bb7f07680850a98f48293c59cdd97041f6c250e |