HTML Parsing for Humans.
Project description
Requests-HTML: HTML Parsing for Humans™
This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.
When using this library you automatically get:
CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
XPath Selectors, for the faint at heart.
Mocked user-agent (like a real web browser).
Automatic following of redirects.
Connection–pooling and cookie persistience.
The Requests experience you know and love, with magic parsing abilities.
Other nice features include:
Markdown export of pages and elements.
Usage
Make a GET request to ‘python.org’, using Requests:
>>> from requests_html import session
>>> r = session.get('https://python.org/')
Grab a list of all links on the page, as–is (anchors excluded):
>>> r.html.links
{'/users/membership/', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/about/success/', 'http://flask.pocoo.org/', 'http://www.djangoproject.com/', '/blogs/', ... '/psf-landing/', 'https://wiki.python.org/moin/PythonBooks'}
Grab a list of all links on the page, in absolute form (anchors excluded):
>>> r.html.absolute_links
{'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/downloads/mac-osx/', 'http://flask.pocoo.org/', 'https://www.python.org//docs.python.org/3/tutorial/', 'http://www.djangoproject.com/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org//docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/about/success/', 'http://twitter.com/ThePSF', 'https://www.python.org/events/python-user-group/634/', ..., 'https://wiki.python.org/moin/PythonBooks'}
Select an element with a jQuery selector.
>>> about = r.html.find('#about', first=True)
Grab an element’s text contents:
>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure
Introspect an Element’s attributes:
>>> about.attrs
{'id': 'about', 'class': 'tier-1 element-1 ', 'aria-haspopup': 'true'}
Select Elements within Elements:
>>> about.find('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]
Render an Element as Markdown:
>>> print(about.markdown)
* [About](/about/)
* [Applications](/about/apps/)
* [Quotes](/about/quotes/)
* [Getting Started](/about/gettingstarted/)
* [Help](/about/help/)
* [Python Brochure](http://brochure.getpython.info/)
Search for text on the page:
>>> r.html.search('Python is a {} language')[0]
programming
More complex CSS Selector example (copied from Chrome dev tools):
>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'
>>> print(r.html.find(sel)[0].text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.
XPath is also supported:
>>> r.html.xpath('a')
[<Element 'a' class='btn' href='https://help.github.com/articles/supported-browsers'>]
Installation
$ pipenv install requests-html
✨🍰✨
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file requests-html-0.1.1.tar.gz
.
File metadata
- Download URL: requests-html-0.1.1.tar.gz
- Upload date:
- Size: 5.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c08ddbc2a0a6b35b004c8788c10446607f523a2845311aef2767b195ac6f6dd2 |
|
MD5 | b0fb3b4dcb21672f1cfe154b5e6b8d34 |
|
BLAKE2b-256 | 724f514f4898c1f42fa997f0ebcfa431ddb731dbdf8165a58109214d3a992f8b |
File details
Details for the file requests_html-0.1.1-py2.py3-none-any.whl
.
File metadata
- Download URL: requests_html-0.1.1-py2.py3-none-any.whl
- Upload date:
- Size: 7.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a506162347ccaefa6e67e9b02102f66abae1217231316d96b1b912da0c9948f0 |
|
MD5 | 43ef5e4f85fd19551007f9266a1101d3 |
|
BLAKE2b-256 | 6355ffdd439238cb4f7e17d1efa484151a172f8189e68e55fdb3c1a3b7179498 |