Scrapinghub's Page Object pattern for web scraping
Project description
web-poet implements Page Object pattern for web scraping.
License is BSD 3-clause.
Installation
pip install web-poet
Usage
Check the following script that uses urllib.request to query data from books.toscrape.com.
import urllib.request
from web_poet.pages import ItemWebPage
from web_poet.page_inputs import ResponseData
class BookLinksPage(ItemWebPage):
@property
def links(self):
return self.css('.image_container a::attr(href)').getall()
def to_item(self) -> dict:
return {
'links': self.links,
}
response = urllib.request.urlopen('http://books.toscrape.com')
response_data = ResponseData(response.url, response.read().decode('utf-8'))
page = BookLinksPage(response_data)
print(page.to_item())
Output should be similar to this:
{
'links': [
'catalogue/a-light-in-the-attic_1000/index.html',
'catalogue/tipping-the-velvet_999/index.html',
'catalogue/soumission_998/index.html',
'catalogue/sharp-objects_997/index.html',
'catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
'catalogue/the-requiem-red_995/index.html',
'catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html',
'catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html',
'catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html',
'catalogue/the-black-maria_991/index.html',
'catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html',
'catalogue/shakespeares-sonnets_989/index.html',
'catalogue/set-me-free_988/index.html',
'catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html',
'catalogue/rip-it-up-and-start-again_986/index.html',
'catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html',
'catalogue/olio_984/index.html',
'catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html',
'catalogue/libertarianism-for-beginners_982/index.html',
'catalogue/its-only-the-himalayas_981/index.html',
]
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
web-poet-0.0.1.tar.gz
(17.7 kB
view details)
File details
Details for the file web-poet-0.0.1.tar.gz
.
File metadata
- Download URL: web-poet-0.0.1.tar.gz
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f6d5f90a5c37c77672dc0fa74ac63a54bb5674334f44fb329b20b3c285d91fa |
|
MD5 | 9db51946725fec16b531555f6b57af24 |
|
BLAKE2b-256 | 10c375b3b0a1e7a4ef8808a9af6819d3fddb9a00e0845c2210fd13751dae5739 |