Skip to main content

HTML Parsing for Humans.

Project description

Requests-HTML: HTML Parsing for Humans™

This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

When using this library you automatically get:

  • CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).

  • XPath Selectors, for the faint at heart.

  • Mocked user-agent (like a real web browser).

  • Automatic following of redirects.

  • Connection–pooling and cookie persistience.

  • The Requests experience you know and love, with magic parsing abilities.

Other nice features include:

  • Markdown export of pages and elements.

Usage

Make a GET request to ‘python.org’, using Requests:

>>> from requests_html import session
>>> r = session.get('https://python.org/')

Grab a list of all links on the page, as–is (anchors excluded):

>>> r.html.links
{'/users/membership/', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/about/success/', 'http://flask.pocoo.org/', 'http://www.djangoproject.com/', '/blogs/', ... '/psf-landing/', 'https://wiki.python.org/moin/PythonBooks'}

Grab a list of all links on the page, in absolute form (anchors excluded):

>>> r.html.absolute_links
{'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/downloads/mac-osx/', 'http://flask.pocoo.org/', 'https://www.python.org//docs.python.org/3/tutorial/', 'http://www.djangoproject.com/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org//docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/about/success/', 'http://twitter.com/ThePSF', 'https://www.python.org/events/python-user-group/634/', ..., 'https://wiki.python.org/moin/PythonBooks'}

Select an element with a jQuery selector.

>>> about = r.html.find('#about', first=True)

Grab an element’s text contents:

>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure

Introspect an Element’s attributes:

>>> about.attrs
{'id': 'about', 'class': 'tier-1 element-1  ', 'aria-haspopup': 'true'}

Select Elements within Elements:

>>> about.find('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]

Render an Element as Markdown:

>>> print(about.markdown)

* [About](/about/)

  * [Applications](/about/apps/)
  * [Quotes](/about/quotes/)
  * [Getting Started](/about/gettingstarted/)
  * [Help](/about/help/)
  * [Python Brochure](http://brochure.getpython.info/)

Search for text on the page:

>>> r.html.search('Python is a {} language')[0]
programming

More complex CSS Selector example (copied from Chrome dev tools):

>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'

>>> print(r.html.find(sel)[0].text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.

XPath is also supported:

>>> r.html.xpath('a')
[<Element 'a' class='btn' href='https://help.github.com/articles/supported-browsers'>]

Installation

$ pipenv install requests-html
✨🍰✨

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

requests-html-0.1.1.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

requests_html-0.1.1-py2.py3-none-any.whl (7.0 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file requests-html-0.1.1.tar.gz.

File metadata

File hashes

Hashes for requests-html-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c08ddbc2a0a6b35b004c8788c10446607f523a2845311aef2767b195ac6f6dd2
MD5 b0fb3b4dcb21672f1cfe154b5e6b8d34
BLAKE2b-256 724f514f4898c1f42fa997f0ebcfa431ddb731dbdf8165a58109214d3a992f8b

See more details on using hashes here.

File details

Details for the file requests_html-0.1.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for requests_html-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a506162347ccaefa6e67e9b02102f66abae1217231316d96b1b912da0c9948f0
MD5 43ef5e4f85fd19551007f9266a1101d3
BLAKE2b-256 6355ffdd439238cb4f7e17d1efa484151a172f8189e68e55fdb3c1a3b7179498

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page