parsel

Parsel is a library to extract data from HTML and XML using XPath and CSS selectors

These details have not been verified by PyPI

Project links

Homepage

Project description

Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with regular expressions.

Find the Parsel online documentation at https://parsel.readthedocs.org.

Example (open online demo):

>>> from parsel import Selector
>>> selector = Selector(text=u"""<html>
        <body>
            <h1>Hello, Parsel!</h1>
            <ul>
                <li><a href="http://example.com">Link 1</a></li>
                <li><a href="http://scrapy.org">Link 2</a></li>
            </ul>
        </body>
        </html>""")
>>> selector.css('h1::text').get()
'Hello, Parsel!'
>>> selector.xpath('//h1/text()').re(r'\w+')
['Hello', 'Parsel']
>>> for li in selector.css('ul > li'):
...     print(li.xpath('.//@href').get())
http://example.com
http://scrapy.org

History

1.6.0 (2020-05-07)

Python 3.4 is no longer supported
New Selector.remove() and SelectorList.remove() methods to remove selected elements from the parsed document tree
Improvements to error reporting, test coverage and documentation, and code cleanup

1.5.2 (2019-08-09)

Selector.remove_namespaces received a significant performance improvement
The value of data within the printable representation of a selector (repr(selector)) now ends in ... when truncated, to make the truncation obvious.
Minor documentation improvements.

1.5.1 (2018-10-25)

has-class XPath function handles newlines and other separators in class names properly;
fixed parsing of HTML documents with null bytes;
documentation improvements;
Python 3.7 tests are run on CI; other test improvements.

1.5.0 (2018-07-04)

New Selector.attrib and SelectorList.attrib properties which make it easier to get attributes of HTML elements.
CSS selectors became faster: compilation results are cached (LRU cache is used for css2xpath), so there is less overhead when the same CSS expression is used several times.
.get() and .getall() selector methods are documented and recommended over .extract_first() and .extract().
Various documentation tweaks and improvements.

One more change is that .extract() and .extract_first() methods are now implemented using .get() and .getall(), not the other way around, and instead of calling Selector.extract all other methods now call Selector.get internally. It can be backwards incompatible in case of custom Selector subclasses which override Selector.extract without doing the same for Selector.get. If you have such Selector subclass, make sure get method is also overridden. For example, this:

class MySelector(parsel.Selector):
    def extract(self):
        return super().extract() + " foo"

should be changed to this:

class MySelector(parsel.Selector):
    def get(self):
        return super().get() + " foo"
    extract = get

1.4.0 (2018-02-08)

Selector and SelectorList can’t be pickled because pickling/unpickling doesn’t work for lxml.html.HtmlElement; parsel now raises TypeError explicitly instead of allowing pickle to silently produce wrong output. This is technically backwards-incompatible if you’re using Python < 3.6.

1.3.1 (2017-12-28)

Fix artifact uploads to pypi.

1.3.0 (2017-12-28)

has-class XPath extension function;
parsel.xpathfuncs.set_xpathfunc is a simplified way to register XPath extensions;
Selector.remove_namespaces now removes namespace declarations;
Python 3.3 support is dropped;
make htmlview command for easier Parsel docs development.
CI: PyPy installation is fixed; parsel now runs tests for PyPy3 as well.

1.2.0 (2017-05-17)

Add SelectorList.get and SelectorList.getall methods as aliases for SelectorList.extract_first and SelectorList.extract respectively
Add default value parameter to SelectorList.re_first method
Add Selector.re_first method
Add replace_entities argument on .re() and .re_first() to turn off replacing of character entity references
Bug fix: detect None result from lxml parsing and fallback with an empty document
Rearrange XML/HTML examples in the selectors usage docs
Travis CI:
- Test against Python 3.6
- Test against PyPy using “Portable PyPy for Linux” distribution

1.1.0 (2016-11-22)

Change default HTML parser to lxml.html.HTMLParser, which makes easier to use some HTML specific features
Add css2xpath function to translate CSS to XPath
Add support for ad-hoc namespaces declarations
Add support for XPath variables
Documentation improvements and updates

1.0.3 (2016-07-29)

Add BSD-3-Clause license file
Re-enable PyPy tests
Integrate py.test runs with setuptools (needed for Debian packaging)
Changelog is now called NEWS

1.0.2 (2016-04-26)

Fix bug in exception handling causing original traceback to be lost
Added docstrings and other doc fixes

1.0.1 (2015-08-24)

Updated PyPI classifiers
Added docstrings for csstranslator module and other doc fixes

1.0.0 (2015-08-22)

Documentation fixes

0.9.6 (2015-08-14)

Updated documentation
Extended test coverage

0.9.5 (2015-08-11)

Support for extending SelectorList

0.9.4 (2015-08-10)

Try workaround for travis-ci/dpl#253

0.9.3 (2015-08-07)

Add base_url argument

0.9.2 (2015-08-07)

Rename module unified -> selector and promoted root attribute
Add create_root_node function

0.9.1 (2015-08-04)

Setup Sphinx build and docs structure
Build universal wheels
Rename some leftovers from package extraction

0.9.0 (2015-07-30)

First release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.9.1

Apr 8, 2024

1.9.0

Mar 14, 2024

1.8.1

Apr 18, 2023

1.7.0

Nov 1, 2022

This version

1.6.0

May 7, 2020

1.5.2

Aug 9, 2019

1.5.1

Oct 25, 2018

1.5.0

Jul 3, 2018

1.4.0

Feb 8, 2018

1.3.1

Dec 28, 2017

1.2.0

May 17, 2017

1.1.0

Nov 22, 2016

1.0.3

Jul 29, 2016

1.0.2

Apr 26, 2016

1.0.1

Aug 24, 2015

1.0.0

Aug 23, 2015

0.9.6

Aug 14, 2015

0.9.5

Aug 11, 2015

0.9.4

Aug 10, 2015

0.9.3

Aug 7, 2015

0.9.2

Aug 7, 2015

0.9.1

Aug 4, 2015

0.9.0

Jul 30, 2015

0.1.0

May 5, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsel-1.6.0.tar.gz (41.8 kB view details)

Uploaded May 7, 2020 Source

Built Distribution

parsel-1.6.0-py2.py3-none-any.whl (13.0 kB view details)

Uploaded May 7, 2020 Python 2 Python 3

File details

Details for the file parsel-1.6.0.tar.gz.

File metadata

Download URL: parsel-1.6.0.tar.gz
Upload date: May 7, 2020
Size: 41.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.1

File hashes

Hashes for parsel-1.6.0.tar.gz
Algorithm	Hash digest
SHA256	`70efef0b651a996cceebc69e55a85eb2233be0890959203ba7c3a03c72725c79`
MD5	`524b9519a20f401cd44f06d7f725c856`
BLAKE2b-256	`57208e7aef69de46de1c991d7880ffb6c046e0cb94ad41e20dcd6a74d02c1c1a`

See more details on using hashes here.

File details

Details for the file parsel-1.6.0-py2.py3-none-any.whl.

File metadata

Download URL: parsel-1.6.0-py2.py3-none-any.whl
Upload date: May 7, 2020
Size: 13.0 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.7.1

File hashes

Hashes for parsel-1.6.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`9e1fa8db1c0b4a878bf34b35c043d89c9d1cbebc23b4d34dbc3c0ec33f2e087d`
MD5	`9916feba2af53946cd278e7f4d76df62`
BLAKE2b-256	`231e9b39d64cbab79d4362cdd7be7f5e9623d45c4a53b3f7522cd8210df52d8e`