Experimental library for leveraging GPT for web scraping.
Project description
scrapeghost
An experiment in using GPT-4 to scrape websites.
Usage
You will need an OpenAI API key with access to the GPT-4 API. Configure those as you otherwise would via the openai
library.
import openai
openai.organization = os.getenv("OPENAI_API_ORG")
openai.api_key = os.getenv("OPENAI_API_KEY")
Then, use SchemaScraper
to create scrapers for pages by defining the shape of the data you want to extract:
>>> from scrapeghost import SchemaScraper
>>> scrape_legislators = SchemaScraper(
schema={
"name": "string",
"url": "url",
"district": "string",
"party": "string",
"photo_url": "url",
"offices": [{"name": "string", "address": "string", "phone": "string"}],
}
)
>>> scrape_legislators("https://www.ilga.gov/house/rep.asp?MemberID=3071")
{'name': 'Emanuel "Chris" Welch',
'url': 'https://www.ilga.gov/house/Rep.asp?MemberID=3071',
'district': '7th', 'party': 'D',
'photo_url': 'https://www.ilga.gov/images/members/{5D419B94-66B4-4F3B-86F1-BFF37B3FA55C}.jpg',
'offices': [
{'name': 'Springfield Office', 'address': '300 Capitol Building, Springfield, IL 62706', 'phone': '(217) 782-5350'},
{'name': 'District Office', 'address': '10055 W. Roosevelt Rd., Suite E, Westchester, IL 60154', 'phone': '(708) 450-1000'}
]}
That's it.
You can also provide a hint to the scraper to help it find the right data, this is useful for managing the total number of tokens sent since the CSS/XPath selector will be executed before sending the data to the API:
>>> scrape_legislators("https://www.ilga.gov/house/rep.asp?MemberID=3071", xpath="//table[1]")
See the blog post for more: https://jamesturk.net/posts/scraping-with-gpt-4/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapeghost-0.1.0.tar.gz
.
File metadata
- Download URL: scrapeghost-0.1.0.tar.gz
- Upload date:
- Size: 8.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.0 CPython/3.10.9 Darwin/21.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6eb9e81c9284e4245c04875a41eed9d6642bc13f20ed255c1ea4c8dd9ed6b431 |
|
MD5 | 925941affce655abb6da220c9dd1ced0 |
|
BLAKE2b-256 | bee3324b41778eec96664462583682c65b86ae22a76d5cafe8f20d71ed4855a0 |
File details
Details for the file scrapeghost-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: scrapeghost-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.4.0 CPython/3.10.9 Darwin/21.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 56f09eb61a7e7c22ba1c8ed76a76ad3ffee0a38a1c23f5a7f72ca1bf0c8e73e9 |
|
MD5 | e181e733fc674299e8dfca01f0f8ea6f |
|
BLAKE2b-256 | 87e3e0da3487cf6877b6a4262817a2be3fe6ac3b74de1e39182a32d9b1c11a57 |