A modern Python library for writing maintainable web scrapers.
Project description
Overview
spatula is a modern Python library for writing maintainable web scrapers.
Source: https://github.com/jamesturk/spatula
Documentation: https://jamesturk.github.io/spatula/
Issues: https://github.com/jamesturk/spatula/issues
Features
- Page-oriented design: Encourages writing understandable & maintainable scrapers.
- Not Just HTML: Provides built in handlers for common data formats including CSV, JSON, XML, PDF, and Excel. Or write your own.
- Fast HTML parsing: Uses
lxml.html
for fast, consistent, and reliable parsing of HTML. - Flexible Data Model Support: Compatible with
dataclasses
,attrs
,pydantic
, or bring your own data model classes for storing & validating your scraped data. - CLI Tools: Offers several CLI utilities that can help streamline development & testing cycle.
- Fully Typed: Makes full use of Python 3 type annotations.
Installation
spatula is on PyPI, and can be installed via any standard package management tool:
poetry add spatula
or:
pip install spatula
Example
An example of a fairly simple two-page scrape, read A First Scraper for a walkthrough of how it was built.
from spatula import HtmlPage, HtmlListPage, CSS, XPath, SelectorError
class EmployeeList(HtmlListPage):
# by providing this here, it can be omitted on the command line
# useful in cases where the scraper is only meant for one page
source = "https://yoyodyne-propulsion.herokuapp.com/staff"
# each row represents an employee
selector = CSS("#employees tbody tr")
def process_item(self, item):
# this function is called for each <tr> we get from the selector
# we know there are 4 <tds>
first, last, position, details = item.getchildren()
return EmployeeDetail(
dict(
first=first.text,
last=last.text,
position=position.text,
),
source=XPath("./a/@href").match_one(details),
)
def get_next_source(self):
try:
return XPath("//a[contains(text(), 'Next')]/@href").match_one(self.root)
except SelectorError:
pass
class EmployeeDetail(HtmlPage):
def process_page(self):
marital_status = CSS("#status").match_one(self.root)
children = CSS("#children").match_one(self.root)
hired = CSS("#hired").match_one(self.root)
return dict(
marital_status=marital_status.text,
children=children.text,
hired=hired.text,
# self.input is the data passed in from the prior scrape
**self.input,
)
def process_error_response(self, exc):
self.logger.warning(exc)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
spatula-0.8.1.tar.gz
(14.7 kB
view details)
Built Distribution
spatula-0.8.1-py3-none-any.whl
(15.0 kB
view details)
File details
Details for the file spatula-0.8.1.tar.gz
.
File metadata
- Download URL: spatula-0.8.1.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.6 CPython/3.9.5 Darwin/20.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d74f07c426ae8fbe080934478360db933e91b118301efb7e6610ef59dcebe353 |
|
MD5 | e7306e14ced32d3309d51ddfee49eabd |
|
BLAKE2b-256 | e5bf6e885d47c355b69c50ea26cd564105502d7bd111d3e1e6e2213a7ff06161 |
File details
Details for the file spatula-0.8.1-py3-none-any.whl
.
File metadata
- Download URL: spatula-0.8.1-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.6 CPython/3.9.5 Darwin/20.3.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c46562ae0ff567b0b13d823a2d3b8fa72da59a5338cdabccc13c5266518b730c |
|
MD5 | 525b8efc066e4e11f0290f8a7c0e7cbe |
|
BLAKE2b-256 | 78b8e660fe91a890e152f0f5eaad15c24d91a9fe8996a5d8148dfe2740ee1faa |