Extract embedded metadata from HTML markup
Project description
extruct is a library for extracting embedded metadata from HTML markup.
It also has a built-in HTTP server to test its output as JSON.
Currently, extruct supports:
The microdata algorithm is a revisit of this Scrapinghub blog post showing how to use EXSLT extensions.
Roadmap
support for Complex Object Properties within Open Graph protocol)
Installation
pip install extruct
Usage
All-in-one extraction
The simplest example how to use extruct is to call extruct.extract(htmlstring, url) with some HTML string and a URL.
Let’s try this on a page on eBay which uses microdata and RDFa (with ogp).
First fetch the HTML using python-requests and then feed the response body to extruct:
>>> import requests >>> from pprint import pprint >>> r = requests.get('http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487') >>> import extruct >>> data = extruct.extract(r.text, r.url) >>> pprint(data) {'json-ld': [], 'microdata': [{'properties': {'image': ['http://i.ebayimg.com/images/g/0M4AAOSwT-FZBeOQ/s-l300.jpg', 'http://i.ebayimg.com/images/g/0M4AAOSwT-FZBeOQ/s-l300.jpg'], 'name': 'Details about \xa0HERBERT TERRY 2 ' 'STEP ANGLEPOISE LAMP MODEL1227', 'offers': {'properties': {'areaServed': 'United ' 'Kingdom ' 'and ' 'many ' 'other ' 'countries \n' '\t\t\t\t\t\t' '| See ' 'details', 'availability': 'http://schema.org/InStock', 'availableAtOrFrom': 'Stockport, ' 'United ' 'Kingdom', 'itemCondition': '--not ' 'specified', 'price': '150.0', 'priceCurrency': 'GBP'}, 'type': 'http://schema.org/Offer'}}, 'type': 'http://schema.org/Product'}, {'properties': {'itemListElement': [{'properties': {'item': 'http://www.ebay.com/sch/Antiques-/20081/i.html', 'name': 'Antiques', 'position': '1'}, 'type': 'http://schema.org/ListItem'}, (...) {'properties': {'item': 'http://www.ebay.com/sch/20th-Century-/66861/i.html', 'name': '20th ' 'Century', 'position': '4'}, 'type': 'http://schema.org/ListItem'}]}, 'type': 'http://schema.org/BreadcrumbList'}], 'rdfa': [{'@id': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487#w1-31-_topHelpTxt', 'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]}, (...) {'@id': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487', 'http://opengraphprotocol.org/schema/description': [{'@value': 'On ' 'one ' 'side ' 'of ' 'the ' 'base ' 'is ' 'a ' 'metal ' 'label ' 'from ' 'UMIST, ' 'where ' 'it ' 'was ' 'in ' 'use. ' '| ' 'eBay!'}], 'http://opengraphprotocol.org/schema/image': [{'@value': 'http://i.ebayimg.com/images/i/282478964487-0-1/s-l1000.jpg'}], 'http://opengraphprotocol.org/schema/site_name': [{'@value': 'eBay'}], 'http://opengraphprotocol.org/schema/title': [{'@value': 'HERBERT ' 'TERRY 2 ' 'STEP ' 'ANGLEPOISE ' 'LAMP ' 'MODEL1227 ' '| eBay'}], 'http://opengraphprotocol.org/schema/type': [{'@value': 'ebay-objects:item'}], 'http://opengraphprotocol.org/schema/url': [{'@value': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487'}], 'http://www.facebook.com/2008/fbmlapp_id': [{'@value': '102628213125203'}]}, {'@id': '_:Na28391785e4e48bb92849fccbe758c6b', 'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#button'}]}, (...) {'@id': 'http://www.ebay.com/itm/HERBERT-TERRY-2-STEP-ANGLEPOISE-LAMP-MODEL1227-/282478964487#glbfooter', 'http://www.w3.org/1999/xhtml/vocab#role': [{'@id': 'http://www.w3.org/1999/xhtml/vocab#contentinfo'}]}]}
Another example with a page from SongKick containing RDFa and JSON-LD metadata:
>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields') >>> data = extruct.extract(r.text, r.url) >>> pprint(data) {'json-ld': [{'@context': 'http://schema.org', '@type': 'MusicEvent', 'location': {'@type': 'Place', 'address': {'@type': 'PostalAddress', 'addressCountry': 'US', 'addressLocality': 'Brooklyn', 'addressRegion': 'NY', 'postalCode': '11225', 'streetAddress': '497 Rogers Ave'}, 'geo': {'@type': 'GeoCoordinates', 'latitude': 40.660109, 'longitude': -73.953193}, 'name': 'The Owl Music Parlor', 'sameAs': 'http://www.theowl.nyc'}, 'name': 'Elysian Fields', 'performer': [{'@type': 'MusicGroup', 'name': 'Elysian Fields', 'sameAs': 'http://www.songkick.com/artists/236156-elysian-fields?utm_medium=organic&utm_source=microformat'}], 'startDate': '2017-06-10T19:30:00-0400', 'url': 'http://www.songkick.com/concerts/30173984-elysian-fields-at-owl-music-parlor?utm_medium=organic&utm_source=microformat'}, (...) {'@context': 'http://schema.org', '@type': 'MusicGroup', 'image': 'https://images.sk-static.com/images/media/profile_images/artists/236156/card_avatar', 'interactionCount': '5557 UserLikes', 'logo': 'https://images.sk-static.com/images/media/profile_images/artists/236156/card_avatar', 'name': 'Elysian Fields', 'url': 'http://www.songkick.com/artists/236156-elysian-fields?utm_medium=organic&utm_source=microformat'}], 'microdata': [], 'rdfa': [{'@id': 'http://www.songkick.com/artists/236156-elysian-fields', 'al:ios:app_name': [{'@value': 'Songkick Concerts'}], 'al:ios:app_store_id': [{'@value': '438690886'}], 'al:ios:url': [{'@value': 'songkick://artists/236156-elysian-fields'}], 'http://ogp.me/ns#description': [{'@value': 'Buy tickets for an ' 'upcoming Elysian ' 'Fields concert near ' 'you. List of all ' 'Elysian Fields tickets ' 'and tour dates for ' '2017.'}], 'http://ogp.me/ns#image': [{'@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}], 'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}], 'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}], 'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}], 'http://ogp.me/ns#url': [{'@value': 'http://www.songkick.com/artists/236156-elysian-fields'}], 'http://www.facebook.com/2008/fbmlapp_id': [{'@value': '308540029359'}]}]}
You can also use each extractor individually. See below.
Microdata extraction
>>> from pprint import pprint >>> >>> from extruct.w3cmicrodata import MicrodataExtractor >>> >>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items >>> html = """<!DOCTYPE HTML> ... <html> ... <head> ... <title>Photo gallery</title> ... </head> ... <body> ... <h1>My photos</h1> ... <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses"> ... <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest."> ... <figcaption itemprop="title">The house I found.</figcaption> ... </figure> ... <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses"> ... <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside."> ... <figcaption itemprop="title">The mailbox.</figcaption> ... </figure> ... <footer> ... <p id="licenses">All images licensed under the <a itemprop="license" ... href="http://www.opensource.org/licenses/mit-license.php">MIT ... license</a>.</p> ... </footer> ... </body> ... </html>""" >>> >>> mde = MicrodataExtractor() >>> data = mde.extract(html) >>> pprint(data) [{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php', 'title': 'The house I found.', 'work': 'http://www.example.com/images/house.jpeg'}, 'type': 'http://n.whatwg.org/work'}, {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php', 'title': 'The mailbox.', 'work': 'http://www.example.com/images/mailbox.jpeg'}, 'type': 'http://n.whatwg.org/work'}]
JSON-LD extraction
>>> from pprint import pprint >>> >>> from extruct.jsonld import JsonLdExtractor >>> >>> html = """<!DOCTYPE HTML> ... <html> ... <head> ... <title>Some Person Page</title> ... </head> ... <body> ... <h1>This guys</h1> ... <script type="application/ld+json"> ... { ... "@context": "http://schema.org", ... "@type": "Person", ... "name": "John Doe", ... "jobTitle": "Graduate research assistant", ... "affiliation": "University of Dreams", ... "additionalName": "Johnny", ... "url": "http://www.example.com", ... "address": { ... "@type": "PostalAddress", ... "streetAddress": "1234 Peach Drive", ... "addressLocality": "Wonderland", ... "addressRegion": "Georgia" ... } ... } ... </script> ... </body> ... </html>""" >>> >>> jslde = JsonLdExtractor() >>> >>> data = jslde.extract(html) >>> pprint(data) [{'@context': 'http://schema.org', '@type': 'Person', 'additionalName': 'Johnny', 'address': {'@type': 'PostalAddress', 'addressLocality': 'Wonderland', 'addressRegion': 'Georgia', 'streetAddress': '1234 Peach Drive'}, 'affiliation': 'University of Dreams', 'jobTitle': 'Graduate research assistant', 'name': 'John Doe', 'url': 'http://www.example.com'}]
RDFa extraction (experimental)
>>> from pprint import pprint >>> from extruct.rdfa import RDFaExtractor # you can ignore the warning about html5lib not being available INFO:rdflib:RDFLib Version: 4.2.1 /home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available. 'parsers will not be available.') >>> >>> html = """<html> ... <head> ... ... ... </head> ... <body prefix="dc: http://purl.org/dc/terms/ schema: http://schema.org/"> ... <div resource="/alice/posts/trouble_with_bob" typeof="schema:BlogPosting"> ... <h2 property="dc:title">The trouble with Bob</h2> ... ... ... <h3 property="dc:creator schema:creator" resource="#me">Alice</h3> ... <div property="schema:articleBody"> ... <p>The trouble with Bob is that he takes much better photos than I do:</p> ... </div> ... ... ... </div> ... </body> ... </html> ... """ >>> >>> rdfae = RDFaExtractor() >>> pprint( ... rdfae.extract(html, url='http://www.example.com/index.html') ... ) [{'@id': 'http://www.example.com/alice/posts/trouble_with_bob', '@type': ['http://schema.org/BlogPosting'], 'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}], 'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}], 'http://schema.org/articleBody': [{'@value': '\n' ' The trouble with Bob ' 'is that he takes much better ' 'photos than I do:\n' ' '}], 'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]
You’ll get a list of expanded JSON-LD nodes.
REST API service
extruct also ships with a REST API service to test its output from URLs.
Dependencies
Usage
python -m extruct.service
launches an HTTP server listening on port 10005.
Methods supported
/extruct/<URL> method = GET /extruct/batch method = POST params: urls - a list of URLs separted by newlines urlsfile - a file with one URL per line
E.g. http://localhost:10005/extruct/http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412
will output something like this:
{ "url":"http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412", "status":"ok", "microdata":[ { "type":"http://schema.org/Product", "properties":{ "name":"Susket", "color":[ "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412", "http://www.sarenza.com/i-love-shoes-susket-s767163-p0000119412" ], "brand":"http://www.sarenza.com/i-love-shoes", "aggregateRating":{ "type":"http://schema.org/AggregateRating", "properties":{ "description":"Soyez le premier \u00e0 donner votre avis" } }, "offers":{ "type":"http://schema.org/AggregateOffer", "properties":{ "lowPrice":"59,00 \u20ac", "price":"A partir de\r\n 59,00 \u20ac", "priceCurrency":"EUR", "highPrice":"59,00 \u20ac", "availability":"http://schema.org/InStock" } }, "size":[ "36 - Epuis\u00e9 - \u00catre alert\u00e9", "37 - Epuis\u00e9 - \u00catre alert\u00e9", "38 - Epuis\u00e9 - \u00catre alert\u00e9", "39 - Derni\u00e8re paire !", "40", "41", "42 - Derni\u00e8re paire !" ], "image":[ "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_09.jpg?201509221045", "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_03.jpg?201509221045", "http://cdn3.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_04.jpg?201509221045", "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_05.jpg?201509221045", "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_06.jpg?201509221045", "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_07.jpg?201509221045", "http://cdn1.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_08.jpg?201509221045", "http://cdn2.sarenza.net/static/_img/productsV4/0000119412/MD_0000119412_223992_02.jpg?201509291747" ], "description":"" } } ] }
Command Line Tool
extruct provides a command line tool that allows you to fetch a page and extract the metadata from it directly from the command line.
Dependencies
The command line tool depends on requests, which is not installed by default when you install extruct. In order to use the command line tool, you can install extruct with the cli extra requirements:
pip install extruct[cli]
Usage
extruct "http://example.com"
Downloads “http://example.com” and outputs the Microdata, JSON-LD and RDFa metadata to stdout.
Supported Parameters
By default, the command line tool will try to extract all the supported metadata formats from the page (currently Microdata, JSON-LD and RDFa). If you want to restrict the output to just one or a subset of those, you can use the individual switches.
For example, this command extracts only Microdata and JSON-LD metadata from “http://example.com”:
extruct --microdata --jsonld "http://example.com"
Development version
mkvirtualenv extruct pip install -r requirements-dev.txt
Tests
Run tests in current environment:
py.test tests
Use tox to run tests with different Python versions:
tox
Versioning
Use bumpversion to conveniently change project version:
bumpversion patch # 0.0.0 -> 0.0.1 bumpversion minor # 0.0.1 -> 0.1.0 bumpversion major # 0.1.0 -> 1.0.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file extruct-0.4.0.tar.gz
.
File metadata
- Download URL: extruct-0.4.0.tar.gz
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | da6678d6b3128ea57ced90b87e5549eeb81835aadc904b8d4a98592f4fc159a1 |
|
MD5 | c71e1aec773d96a6b0858e2d69d88520 |
|
BLAKE2b-256 | 896e10eb4d71ce9ef122859d174f096c647d012cf0d6effb556b4978e3fddbb1 |
File details
Details for the file extruct-0.4.0-py2.py3-none-any.whl
.
File metadata
- Download URL: extruct-0.4.0-py2.py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 89c62671892ea8963850e63a11b7081cd1d4b008175abfc3f9f76067ed2f40b5 |
|
MD5 | 1e697caba6e707fdfe1108ef96622540 |
|
BLAKE2b-256 | fc160aec430185bdafd0a1486f8c286f155e07c7673425d632e0ffeb1e5aed1f |