Skip to main content

Parsel is a library to extract data from HTML and XML using XPath and CSS selectors

Project description

Tests Supported Python versions PyPI Version Coverage report

Parsel is a BSD-licensed Python library to extract data from HTML, JSON, and XML documents.

It supports:

Find the Parsel online documentation at https://parsel.readthedocs.org.

Example (open online demo):

>>> from parsel import Selector
>>> text = """
        <html>
            <body>
                <h1>Hello, Parsel!</h1>
                <ul>
                    <li><a href="http://example.com">Link 1</a></li>
                    <li><a href="http://scrapy.org">Link 2</a></li>
                </ul>
                <script type="application/json">{"a": ["b", "c"]}</script>
            </body>
        </html>"""
>>> selector = Selector(text=text)
>>> selector.css('h1::text').get()
'Hello, Parsel!'
>>> selector.xpath('//h1/text()').re(r'\w+')
['Hello', 'Parsel']
>>> for li in selector.css('ul > li'):
...     print(li.xpath('.//@href').get())
http://example.com
http://scrapy.org
>>> selector.css('script::text').jmespath("a").get()
'b'
>>> selector.css('script::text').jmespath("a").getall()
['b', 'c']

History

1.9.0 (2024-03-14)

  • Now requires cssselect >= 1.2.0 (this minimum version was required since 1.8.0 but that wasn’t properly recorded)

  • Removed support for Python 3.7

  • Added support for Python 3.12 and PyPy 3.10

  • Fixed an exception when calling __str__ or __repr__` on some JSON selectors

  • Code formatted with black

  • CI fixes and improvements

1.8.1 (2023-04-18)

  • Remove a Sphinx reference from NEWS to fix the PyPI description

  • Add a twine check CI check to detect such problems

1.8.0 (2023-04-18)

  • Add support for JMESPath: you can now create a selector for a JSON document and call Selector.jmespath(). See the documentation for more information and examples.

  • Selectors can now be constructed from bytes (using the body and encoding arguments) instead of str (using the text argument), so that there is no internal conversion from str to bytes and the memory usage is lower.

  • Typing improvements

  • The pkg_resources module (which was absent from the requirements) is no longer used

  • Documentation build fixes

  • New requirements:

    • jmespath

    • typing_extensions (on Python 3.7)

1.7.0 (2022-11-01)

  • Add PEP 561-style type information

  • Support for Python 2.7, 3.5 and 3.6 is removed

  • Support for Python 3.9-3.11 is added

  • Very large documents (with deep nesting or long tag content) can now be parsed, and Selector now takes a new argument huge_tree to disable this

  • Support for new features of cssselect 1.2.0 is added

  • The Selector.remove() and SelectorList.remove() methods are deprecated and replaced with the new Selector.drop() and SelectorList.drop() methods which don’t delete text after the dropped elements when used in the HTML mode.

1.6.0 (2020-05-07)

  • Python 3.4 is no longer supported

  • New Selector.remove() and SelectorList.remove() methods to remove selected elements from the parsed document tree

  • Improvements to error reporting, test coverage and documentation, and code cleanup

1.5.2 (2019-08-09)

  • Selector.remove_namespaces received a significant performance improvement

  • The value of data within the printable representation of a selector (repr(selector)) now ends in ... when truncated, to make the truncation obvious.

  • Minor documentation improvements.

1.5.1 (2018-10-25)

  • has-class XPath function handles newlines and other separators in class names properly;

  • fixed parsing of HTML documents with null bytes;

  • documentation improvements;

  • Python 3.7 tests are run on CI; other test improvements.

1.5.0 (2018-07-04)

  • New Selector.attrib and SelectorList.attrib properties which make it easier to get attributes of HTML elements.

  • CSS selectors became faster: compilation results are cached (LRU cache is used for css2xpath), so there is less overhead when the same CSS expression is used several times.

  • .get() and .getall() selector methods are documented and recommended over .extract_first() and .extract().

  • Various documentation tweaks and improvements.

One more change is that .extract() and .extract_first() methods are now implemented using .get() and .getall(), not the other way around, and instead of calling Selector.extract all other methods now call Selector.get internally. It can be backwards incompatible in case of custom Selector subclasses which override Selector.extract without doing the same for Selector.get. If you have such Selector subclass, make sure get method is also overridden. For example, this:

class MySelector(parsel.Selector):
    def extract(self):
        return super().extract() + " foo"

should be changed to this:

class MySelector(parsel.Selector):
    def get(self):
        return super().get() + " foo"
    extract = get

1.4.0 (2018-02-08)

  • Selector and SelectorList can’t be pickled because pickling/unpickling doesn’t work for lxml.html.HtmlElement; parsel now raises TypeError explicitly instead of allowing pickle to silently produce wrong output. This is technically backwards-incompatible if you’re using Python < 3.6.

1.3.1 (2017-12-28)

  • Fix artifact uploads to pypi.

1.3.0 (2017-12-28)

  • has-class XPath extension function;

  • parsel.xpathfuncs.set_xpathfunc is a simplified way to register XPath extensions;

  • Selector.remove_namespaces now removes namespace declarations;

  • Python 3.3 support is dropped;

  • make htmlview command for easier Parsel docs development.

  • CI: PyPy installation is fixed; parsel now runs tests for PyPy3 as well.

1.2.0 (2017-05-17)

  • Add SelectorList.get and SelectorList.getall methods as aliases for SelectorList.extract_first and SelectorList.extract respectively

  • Add default value parameter to SelectorList.re_first method

  • Add Selector.re_first method

  • Add replace_entities argument on .re() and .re_first() to turn off replacing of character entity references

  • Bug fix: detect None result from lxml parsing and fallback with an empty document

  • Rearrange XML/HTML examples in the selectors usage docs

  • Travis CI:

    • Test against Python 3.6

    • Test against PyPy using “Portable PyPy for Linux” distribution

1.1.0 (2016-11-22)

  • Change default HTML parser to lxml.html.HTMLParser, which makes easier to use some HTML specific features

  • Add css2xpath function to translate CSS to XPath

  • Add support for ad-hoc namespaces declarations

  • Add support for XPath variables

  • Documentation improvements and updates

1.0.3 (2016-07-29)

  • Add BSD-3-Clause license file

  • Re-enable PyPy tests

  • Integrate py.test runs with setuptools (needed for Debian packaging)

  • Changelog is now called NEWS

1.0.2 (2016-04-26)

  • Fix bug in exception handling causing original traceback to be lost

  • Added docstrings and other doc fixes

1.0.1 (2015-08-24)

  • Updated PyPI classifiers

  • Added docstrings for csstranslator module and other doc fixes

1.0.0 (2015-08-22)

  • Documentation fixes

0.9.6 (2015-08-14)

  • Updated documentation

  • Extended test coverage

0.9.5 (2015-08-11)

  • Support for extending SelectorList

0.9.4 (2015-08-10)

  • Try workaround for travis-ci/dpl#253

0.9.3 (2015-08-07)

  • Add base_url argument

0.9.2 (2015-08-07)

  • Rename module unified -> selector and promoted root attribute

  • Add create_root_node function

0.9.1 (2015-08-04)

  • Setup Sphinx build and docs structure

  • Build universal wheels

  • Rename some leftovers from package extraction

0.9.0 (2015-07-30)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsel-1.9.0.tar.gz (51.1 kB view details)

Uploaded Source

Built Distribution

parsel-1.9.0-py2.py3-none-any.whl (17.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file parsel-1.9.0.tar.gz.

File metadata

  • Download URL: parsel-1.9.0.tar.gz
  • Upload date:
  • Size: 51.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for parsel-1.9.0.tar.gz
Algorithm Hash digest
SHA256 a5a6bcb0c5fc741540ba3075636ff5cb968852a78cc16ae82250e5d486fa7d48
MD5 b5d22fb132fdb4c87f785004b55d5e2d
BLAKE2b-256 55725129a09cc8148cee9d4be3c9b89dd9ba1c89ec9b5db2c454871b9d30e615

See more details on using hashes here.

Provenance

File details

Details for the file parsel-1.9.0-py2.py3-none-any.whl.

File metadata

  • Download URL: parsel-1.9.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.12.2

File hashes

Hashes for parsel-1.9.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 1272baeda331cbc7f3a9a91df4c15c39276dc94b4ff2e8d5cffdf8bb6c382ea1
MD5 1ab5779376dd2d103e0e11de486b6b18
BLAKE2b-256 70d51228d35ffdee804eddcd33a8f5f0fd669e85026a98bcaf5433101bdc7478

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page