scrapely

A pure-python HTML screen-scraping library

These details have not been verified by PyPI

Project links

Homepage

Project description

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

How does Scrapely relate to Scrapy?

Despite the similarity in their names, Scrapely and Scrapy are quite different things. The only similarity they share is that they both depend on w3lib, and they are both maintained by the same group of developers (which is why both are hosted on the same Github account).

Scrapy is an application framework for building web crawlers, while Scrapely is a library for extracting structured data from HTML pages. If anything, Scrapely is more similar to BeautifulSoup or lxml than Scrapy.

Scrapely doesn’t depend on Scrapy nor the other way around. In fact, it is quite common to use Scrapy without Scrapely, and viceversa.

If you are looking for a complete crawler-scraper solution, there is (at least) one project called Slybot that integrates both, but you can definitely use Scrapely on other web crawlers since it’s just a library.

Scrapy has a builtin extraction mechanism called selectors which (unlike Scrapely) is based on XPaths.

Usage (API)

Scrapely has a powerful API, including a template format that can be edited externally, that you can use to build very capable scrapers.

What follows is a quick example of the simplest possible usage, that you can run in a Python shell.

Start by importing and instantiating the Scraper class:

>>> from scrapely import Scraper
>>> s = Scraper()

Then, proceed to train the scraper by adding some page and the data you expect to scrape from there (note that all keys and values in the data you pass must be strings):

>>> url1 = 'http://pypi.python.org/pypi/w3lib/1.1'
>>> data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
>>> s.train(url1, data)

Finally, tell the scraper to scrape any other similar page and it will return the results:

>>> url2 = 'http://pypi.python.org/pypi/Django/1.3'
>>> s.scrape(url2)
[{u'author': [u'Django Software Foundation &lt;foundation at djangoproject com&gt;'],
  u'description': [u'A high-level Python Web framework that encourages rapid development and clean, pragmatic design.'],
  u'name': [u'Django 1.3']}]

That’s it! No xpaths, regular expressions, or hacky python code.

Usage (command line tool)

There is also a simple script to create and manage Scrapely scrapers.

It supports a command-line interface, and an interactive prompt. All commands supported on interactive prompt are also supported in the command-line interface.

To enter the interactive prompt type the following without arguments:

python -m scrapely.tool myscraper.json

Example:

$ python -m scrapely.tool myscraper.json
scrapely> help

Documented commands (type help <topic>):
========================================
a  al  s  ta  td  tl

scrapely>

To create a scraper and add a template:

scrapely> ta http://pypi.python.org/pypi/w3lib
[0] http://pypi.python.org/pypi/w3lib

This is equivalent as typing the following in one command:

python -m scrapely.tool myscraper.json ta http://pypi.python.org/pypi/w3lib

To list available templates from a scraper:

scrapely> tl
[0] http://pypi.python.org/pypi/w3lib

To add a new annotation, you usually test the selection criteria first:

scrapely> t 0 w3lib 1.1
[0] u'<a href="/pypi/w3lib/1.1">w3lib 1.1</a>'
[1] u'<h1>w3lib 1.1</h1>'
[2] u'<title>Python Package Index : w3lib 1.1</title>'

You can also quote the text, if you need to specify an arbitrary number of spaces, for example:

scrapely> t 0 "w3lib 1.1"

You can refine by position. To take the one in position [1]:

scrapely> a 0 w3lib 1.1 -n 1
[0] u'<h1>w3lib 1.1</h1>'

To annotate some fields on the template:

scrapely> a 0 w3lib 1.1 -n 1 -f name
[new] (name) u'<h1>w3lib 1.1</h1>'
scrapely> a 0 Scrapy project -n 0 -f author
[new] u'<span>Scrapy project</span>'

To list annotations on a template:

scrapely> al 0
[0-0] (name) u'<h1>w3lib 1.1</h1>'
[0-1] (author) u'<span>Scrapy project</span>'

To scrape another similar page with the already added templates:

scrapely> s http://pypi.python.org/pypi/Django/1.3
[{u'author': [u'Django Software Foundation'], u'name': [u'Django 1.3']}]

Requirements

Python 2.6 or 2.7
numpy
w3lib

A couple of notes regarding dependencies:

Scrapely does not depend on Scrapy in any way
Python 3 is not supported yet (pull requests welcome!)

Additional requirements for running tests:

Installation

To install scrapely on any platform use:

pip install scrapely

If you’re using Ubuntu (9.10 or above), you can install scrapely from the Scrapy Ubuntu repos. Just add the Ubuntu repos as described here: http://doc.scrapy.org/en/latest/topics/ubuntu.html

And then install scrapely with:

aptitude install python-scrapely

Tests

tox is the preferred way to run tests. Just run: tox from the root directory.

Support

Mailing list: https://groups.google.com/forum/#!forum/scrapely
IRC: scrapy@freenode

Scrapely is created and maintained by the Scrapy group, so you can get help through the usual support channels described in the Scrapy community page.

Architecture

Unlike most scraping libraries, Scrapely doesn’t work with DOM trees or xpaths so it doesn’t depend on libraries such as lxml or libxml2. Instead, it uses an internal pure-python parser, which can accept poorly formed HTML. The HTML is converted into an array of token ids, which is used for matching the items to be extracted.

Scrapely extraction is based upon the Instance Based Learning algorithm [1] and the matched items are combined into complex objects (it supports nested and repeated objects), using a tree of parsers, inspired by A Hierarchical Approach to Wrapper Induction [2].

Known Issues

The training implementation is currently very simple and is only provided for references purposes, to make it easier to test Scrapely and play with it. On the other hand, the extraction code is reliable and production-ready. So, if you want to use Scrapely in production, you should use train() with caution and make sure it annotates the area of the page you intended.

Alternatively, you can use the Scrapely command line tool to annotate pages, which provides more manual control for higher accuracy.

License

Scrapely library is licensed under the BSD license.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.14.1

Nov 28, 2019

0.14.0

Jun 18, 2019

0.13.6

Nov 28, 2019

0.13.5

Jun 18, 2019

0.13.4

May 26, 2017

0.13.3

Jan 27, 2017

0.13.2

Dec 21, 2016

0.13.1

Dec 21, 2016

0.13.0

Dec 21, 2016

0.12.0

Jan 26, 2015

This version

0.11.0

Aug 1, 2014

0.10

Jan 14, 2014

0.9

Apr 19, 2011

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapely-0.11.0.tar.gz (29.2 kB view details)

Uploaded Aug 1, 2014 Source

Built Distribution

scrapely-0.11.0-py2-none-any.whl (31.8 kB view details)

Uploaded Aug 1, 2014 Python 2

File details

Details for the file scrapely-0.11.0.tar.gz.

File metadata

Download URL: scrapely-0.11.0.tar.gz
Upload date: Aug 1, 2014
Size: 29.2 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for scrapely-0.11.0.tar.gz
Algorithm	Hash digest
SHA256	`6d7518c4acb270cf6116a1ccc65a2ff3eb311e6b65bdb3fd71dff98d09a7b17e`
MD5	`bd08fd66f4384c9894fc2b1e89ca94a6`
BLAKE2b-256	`eea8a6b8886fff2846049f6cd49febcccab6825feb5d0d330bae5257ffa29080`

See more details on using hashes here.

Provenance

File details

Details for the file scrapely-0.11.0-py2-none-any.whl.

File metadata

Download URL: scrapely-0.11.0-py2-none-any.whl
Upload date: Aug 1, 2014
Size: 31.8 kB
Tags: Python 2
Uploaded using Trusted Publishing? No

File hashes

Hashes for scrapely-0.11.0-py2-none-any.whl
Algorithm	Hash digest
SHA256	`e2c36bf25e4db6286e9a08339f61969a07dc72209acc4b9950ea0a6be4acc0d7`
MD5	`0ea2357c6ceba768addd7e03ca7d9172`
BLAKE2b-256	`2c1ac89ba5e74703727bfc63a34579eb5efe7b4f488ee0e522a4c264cb92537e`

See more details on using hashes here.

scrapely 0.11.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

How does Scrapely relate to Scrapy?

Usage (API)

Usage (command line tool)

Requirements

Installation

Tests

Support

Architecture

Known Issues

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance