A fast DOM parser
Project description
This is a fork of Common Functions and ParseDOM for use outside of XBMC.
Getting element content.
from parsedom import parseDOM
link_html = "<a href='bla.html'>Link Test</a>"
ret = parseDOM(link_html, "a")
print repr(ret) # Prints ['Link Test']
Getting an element attribute.
link_html = "<a href='bla.html'>Link Test</a>"
ret = parseDOM(link_html, "a", ret = "href")
print repr(ret) # Prints ['bla.html']
Get element with matching attribute.
link_html = "<a href='bla1.html' id='link1'>Link Test1</a><a href='bla2.html' id='link2'>Link Test2</a><a href='bla3.html' id='link3'>Link Test3</a>"
ret1 = parseDOM(link_html, "a", attrs = { "id": "link1" }, ret = "href")
ret2 = parseDOM(link_html, "a", attrs = { "id": "link2" })
ret3 = parseDOM(link_html, "a", attrs = { "id": "link3" }, ret = "id")
print repr(ret1) # Prints ['bla1.html']
print repr(ret2) # Prints ['Link Test2']
print repr(ret3) # Prints ['link3']
When scraping sites it is prudent to scrape in steps, since real websites are often complicated.
Take this example where you want to get all the user uploads.
<div id="content">
<div id="sidebar">
<div id="latest">
<a href="/video?8wxOVn99FTE">Miley Cyrus - When I Look At You</a>>br /<
<a href="/video?46">Puppet theater</a><br />
<a href="/video?98">VBLOG #42</a><br />
<a href="/video?11">Fourth upload</a><br />
</div>
</div>
<div id="user">
<div id="uploads">
<a href="/video?12">First upload</a><br />
<a href="/video?23">Second upload</a><br />
<a href="/video?34">Third upload</a><br />
<a href="/video?41">Fourth upload</a><br />
</div>
</div>
</div>
The first step is to limit your search to the correct area.
One should always find the inner most DOM element that contains the needed data.
ret = parseDOM(html, "div", attrs = { "id": "uploads" })
The variable ret now contains
['<a href="/video?12">First upload</a><br />
<a href="/video?23">Second upload</a><br />
<a href="/video?34">Third upload</a><br />
<a href="/video?41">Fourth upload</a><br />']
And now we get the video url.
videos = parseDOM(ret, "a", ret = "href")
print repr(videos) # Prints [ "video?12", "video?23", "video?34", "video?41" ]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
parsedom-1.0.0.tar.gz
(16.8 kB
view details)
File details
Details for the file parsedom-1.0.0.tar.gz
.
File metadata
- Download URL: parsedom-1.0.0.tar.gz
- Upload date:
- Size: 16.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09c15a77c9115127d38b330bc8d688506d5282b4ed0aaa910604587f23ca43b8 |
|
MD5 | 4247bc3bab09a6166773cf55d398ce2c |
|
BLAKE2b-256 | b2cbdd97f8e212095cb947b40cbc3748d05a003b0bfe1ca85a1a4a24548305ae |