Skip to main content

Scrapy helper to create scrapers from models

Project description

Create scraper using Scrapy Selectors
============================================

allows you to select by CSS or by XPATH

Implemented in a Model approach, you create a Fetcher class and defines some fields which points to Xpath or Css selectors, those fields are fetched and an object populated with data.

Data can be normalized using ``parse_<field>`` methods.

### Instalation

easy to install

If running ubuntu maybe you need to run:

```
sudo apt-get install python-scrapy
sudo apt-get install libffi-dev
sudo apt-get install python-dev
```

then

```
pip install scrapy_model
```

or


```
git clone https://github.com/rochacbruno/scrapy_model
cd scrapy_model
pip install -r requirements.txt
python setup.py install
python example.py
```

Example code to fetch the url http://en.m.wikipedia.org/wiki/Guido_van_Rossum

```
#coding: utf-8

from scrapy_model import BaseFetcherModel, CSSField, XPathField


class TestFetcher(BaseFetcherModel):
photo_url = XPathField('//*[@id="content"]/div[1]/table/tr[2]/td/a')

nationality = CSSField(
'#content > div:nth-child(1) > table > tr:nth-child(4) > td > a',
)

links = CSSField(
'#content > div:nth-child(11) > ul > li > a.external::attr(href)',
auto_extract=True
)

def parse_photo_url(self, selector):
return "http://en.m.wikipedia.org/{}".format(
selector.xpath("@href").extract()[0]
)

def parse_nationality(self, selector):
return selector.css("::text").extract()[0]

def parse_name(self, selector):
return selector.extract()[0]

def post_parse(self):
# executed after all parsers
# you can load any data on to self._data
# access self._data and self._fields for current data
# self.selector contains original page
# self.fetch() returns original html
self._data.url = self.url


class DummyModel(object):
"""
For tests only, it can be a model in your database ORM
"""


if __name__ == "__main__":
from pprint import pprint

fetcher = TestFetcher(cache_fetch=True)
fetcher.url = "http://en.m.wikipedia.org/wiki/Guido_van_Rossum"

# Mappings can be loaded from a json file
# fetcher.load_mappings_from_file('path/to/file')
fetcher.mappings['name'] = {
"css": ("#section_0::text")
}

fetcher.parse()

print "Fetcher holds the data"
print fetcher._data.name
print fetcher._data

# How to populate an object
print "Populating an object"
dummy = DummyModel()

fetcher.populate(dummy, fields=["name", "nationality"])
# fields attr is optional
print dummy.nationality
pprint(dummy.__dict__)

```

# outputs


```
Fetcher holds the data
Guido van Rossum
{'links': [u'http://www.python.org/~guido/',
u'http://neopythonic.blogspot.com/',
u'http://www.artima.com/weblogs/index.jsp?blogger=guido',
u'http://python-history.blogspot.com/',
u'http://www.python.org/doc/essays/cp4e.html',
u'http://www.twit.tv/floss11',
u'http://www.computerworld.com.au/index.php/id;66665771',
u'http://www.stanford.edu/class/ee380/Abstracts/081105.html',
u'http://stanford-online.stanford.edu/courses/ee380/081105-ee380-300.asx'],
'name': u'Guido van Rossum',
'nationality': u'Dutch',
'photo_url': 'http://en.m.wikipedia.org//wiki/File:Guido_van_Rossum_OSCON_2006.jpg',
'url': 'http://en.m.wikipedia.org/wiki/Guido_van_Rossum'}
Populating an object
Dutch
{'name': u'Guido van Rossum', 'nationality': u'Dutch'}
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_model-0.1.0.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

scrapy_model-0.1.0-py2.py3-none-any.whl (5.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scrapy_model-0.1.0.tar.gz.

File metadata

File hashes

Hashes for scrapy_model-0.1.0.tar.gz
Algorithm Hash digest
SHA256 25cd49d39582d91b8bd845bb4eb9a4f9ace91a471527f5055999fb2ab542c4d9
MD5 96d90b080fa7bc4214135ee72f5b511f
BLAKE2b-256 647aec1f9e3bbf2a61c8ea92f6ef5c17a1b54f268ba85cd92b4849a5f15733fb

See more details on using hashes here.

File details

Details for the file scrapy_model-0.1.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_model-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 fa2272a2d0d7f4c152dfc3770808499fcac250ccd109d6501f604650eb25ecd5
MD5 96adcd259c9fb04412e5878d62f45ce4
BLAKE2b-256 b3bc2a0a94d56de7cde5c5b67cade6ced8789a1d6766340475c0f054ac97b7e5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page