Skip to main content

Scrapy helper to create scrapers from models

Project description

Create scraper using Scrapy Selectors
============================================

allows you to select by CSS or by XPATH

Implemented in a Model approach, you create a Fetcher class and defines some fields which points to Xpath or Css selectors, those fields are fetched and an object populated with data.

Data can be normalized using ``parse_<field>`` methods.

### Instalation

easy to install

If running ubuntu maybe you need to run:

```
sudo apt-get install python-scrapy
sudo apt-get install libffi-dev
sudo apt-get install python-dev
```

then

```
pip install scrapy_model
```

or


```
git clone https://github.com/rochacbruno/scrapy_model
cd scrapy_model
pip install -r requirements.txt
python setup.py install
python example.py
```

Example code to fetch the url http://en.m.wikipedia.org/wiki/Guido_van_Rossum

```
#coding: utf-8

from scrapy_model import BaseFetcherModel, CSSField, XPathField


class TestFetcher(BaseFetcherModel):
photo_url = XPathField('//*[@id="content"]/div[1]/table/tr[2]/td/a')

nationality = CSSField(
'#content > div:nth-child(1) > table > tr:nth-child(4) > td > a',
)

links = CSSField(
'#content > div:nth-child(11) > ul > li > a.external::attr(href)',
auto_extract=True
)

def parse_photo_url(self, selector):
return "http://en.m.wikipedia.org/{}".format(
selector.xpath("@href").extract()[0]
)

def parse_nationality(self, selector):
return selector.css("::text").extract()[0]

def parse_name(self, selector):
return selector.extract()[0]

def post_parse(self):
# executed after all parsers
# you can load any data on to self._data
# access self._data and self._fields for current data
# self.selector contains original page
# self.fetch() returns original html
self._data.url = self.url


class DummyModel(object):
"""
For tests only, it can be a model in your database ORM
"""


if __name__ == "__main__":
from pprint import pprint

fetcher = TestFetcher(cache_fetch=True)
fetcher.url = "http://en.m.wikipedia.org/wiki/Guido_van_Rossum"

# Mappings can be loaded from a json file
# fetcher.load_mappings_from_file('path/to/file')
fetcher.mappings['name'] = {
"css": ("#section_0::text")
}

fetcher.parse()

print "Fetcher holds the data"
print fetcher._data.name
print fetcher._data

# How to populate an object
print "Populating an object"
dummy = DummyModel()

fetcher.populate(dummy, fields=["name", "nationality"])
# fields attr is optional
print dummy.nationality
pprint(dummy.__dict__)

```

# outputs


```
Fetcher holds the data
Guido van Rossum
{'links': [u'http://www.python.org/~guido/',
u'http://neopythonic.blogspot.com/',
u'http://www.artima.com/weblogs/index.jsp?blogger=guido',
u'http://python-history.blogspot.com/',
u'http://www.python.org/doc/essays/cp4e.html',
u'http://www.twit.tv/floss11',
u'http://www.computerworld.com.au/index.php/id;66665771',
u'http://www.stanford.edu/class/ee380/Abstracts/081105.html',
u'http://stanford-online.stanford.edu/courses/ee380/081105-ee380-300.asx'],
'name': u'Guido van Rossum',
'nationality': u'Dutch',
'photo_url': 'http://en.m.wikipedia.org//wiki/File:Guido_van_Rossum_OSCON_2006.jpg',
'url': 'http://en.m.wikipedia.org/wiki/Guido_van_Rossum'}
Populating an object
Dutch
{'name': u'Guido van Rossum', 'nationality': u'Dutch'}
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy_model-0.1.1.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

scrapy_model-0.1.1-py2.py3-none-any.whl (5.1 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scrapy_model-0.1.1.tar.gz.

File metadata

File hashes

Hashes for scrapy_model-0.1.1.tar.gz
Algorithm Hash digest
SHA256 291c378abd72da5b79ad90a467a86f3121f1ba96f908db762ce14f8784f9b50c
MD5 b143b8357c5ae28a96877b22e1615739
BLAKE2b-256 4f1b68309b2f6455f0636a4484068f0601bc5faaa2be91484bb67679de35d44a

See more details on using hashes here.

File details

Details for the file scrapy_model-0.1.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_model-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6f34ac4ac50422e5868f0572524aca364c7a318480bd7f7bf1d106cdbc922f60
MD5 11f7ab7e3013ecc7d5e7ac1bf62a42c8
BLAKE2b-256 36700ce739c1b0c6c33d50ccc1948bb1254dea841a50b716b01eed202b15ed35

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page