Scrapy helper to create scrapers from models
Project description
Create scraper using Scrapy Selectors
============================================
allows you to select by CSS or by XPATH
Implemented in a Model approach, you create a Fetcher class and defines some fields which points to Xpath or Css selectors, those fields are fetched and an object populated with data.
Data can be normalized using ``parse_<field>`` methods.
### Instalation
easy to install
If running ubuntu maybe you need to run:
```
sudo apt-get install python-scrapy
sudo apt-get install libffi-dev
sudo apt-get install python-dev
```
then
```
pip install scrapy_model
```
or
```
git clone https://github.com/rochacbruno/scrapy_model
cd scrapy_model
pip install -r requirements.txt
python setup.py install
python example.py
```
Example code to fetch the url http://en.m.wikipedia.org/wiki/Guido_van_Rossum
```
#coding: utf-8
from scrapy_model import BaseFetcherModel, CSSField, XPathField
class TestFetcher(BaseFetcherModel):
photo_url = XPathField('//*[@id="content"]/div[1]/table/tr[2]/td/a')
nationality = CSSField(
'#content > div:nth-child(1) > table > tr:nth-child(4) > td > a',
)
links = CSSField(
'#content > div:nth-child(11) > ul > li > a.external::attr(href)',
auto_extract=True
)
def parse_photo_url(self, selector):
return "http://en.m.wikipedia.org/{}".format(
selector.xpath("@href").extract()[0]
)
def parse_nationality(self, selector):
return selector.css("::text").extract()[0]
def parse_name(self, selector):
return selector.extract()[0]
def post_parse(self):
# executed after all parsers
# you can load any data on to self._data
# access self._data and self._fields for current data
# self.selector contains original page
# self.fetch() returns original html
self._data.url = self.url
class DummyModel(object):
"""
For tests only, it can be a model in your database ORM
"""
if __name__ == "__main__":
from pprint import pprint
fetcher = TestFetcher(cache_fetch=True)
fetcher.url = "http://en.m.wikipedia.org/wiki/Guido_van_Rossum"
# Mappings can be loaded from a json file
# fetcher.load_mappings_from_file('path/to/file')
fetcher.mappings['name'] = {
"css": ("#section_0::text")
}
fetcher.parse()
print "Fetcher holds the data"
print fetcher._data.name
print fetcher._data
# How to populate an object
print "Populating an object"
dummy = DummyModel()
fetcher.populate(dummy, fields=["name", "nationality"])
# fields attr is optional
print dummy.nationality
pprint(dummy.__dict__)
```
# outputs
```
Fetcher holds the data
Guido van Rossum
{'links': [u'http://www.python.org/~guido/',
u'http://neopythonic.blogspot.com/',
u'http://www.artima.com/weblogs/index.jsp?blogger=guido',
u'http://python-history.blogspot.com/',
u'http://www.python.org/doc/essays/cp4e.html',
u'http://www.twit.tv/floss11',
u'http://www.computerworld.com.au/index.php/id;66665771',
u'http://www.stanford.edu/class/ee380/Abstracts/081105.html',
u'http://stanford-online.stanford.edu/courses/ee380/081105-ee380-300.asx'],
'name': u'Guido van Rossum',
'nationality': u'Dutch',
'photo_url': 'http://en.m.wikipedia.org//wiki/File:Guido_van_Rossum_OSCON_2006.jpg',
'url': 'http://en.m.wikipedia.org/wiki/Guido_van_Rossum'}
Populating an object
Dutch
{'name': u'Guido van Rossum', 'nationality': u'Dutch'}
```
============================================
allows you to select by CSS or by XPATH
Implemented in a Model approach, you create a Fetcher class and defines some fields which points to Xpath or Css selectors, those fields are fetched and an object populated with data.
Data can be normalized using ``parse_<field>`` methods.
### Instalation
easy to install
If running ubuntu maybe you need to run:
```
sudo apt-get install python-scrapy
sudo apt-get install libffi-dev
sudo apt-get install python-dev
```
then
```
pip install scrapy_model
```
or
```
git clone https://github.com/rochacbruno/scrapy_model
cd scrapy_model
pip install -r requirements.txt
python setup.py install
python example.py
```
Example code to fetch the url http://en.m.wikipedia.org/wiki/Guido_van_Rossum
```
#coding: utf-8
from scrapy_model import BaseFetcherModel, CSSField, XPathField
class TestFetcher(BaseFetcherModel):
photo_url = XPathField('//*[@id="content"]/div[1]/table/tr[2]/td/a')
nationality = CSSField(
'#content > div:nth-child(1) > table > tr:nth-child(4) > td > a',
)
links = CSSField(
'#content > div:nth-child(11) > ul > li > a.external::attr(href)',
auto_extract=True
)
def parse_photo_url(self, selector):
return "http://en.m.wikipedia.org/{}".format(
selector.xpath("@href").extract()[0]
)
def parse_nationality(self, selector):
return selector.css("::text").extract()[0]
def parse_name(self, selector):
return selector.extract()[0]
def post_parse(self):
# executed after all parsers
# you can load any data on to self._data
# access self._data and self._fields for current data
# self.selector contains original page
# self.fetch() returns original html
self._data.url = self.url
class DummyModel(object):
"""
For tests only, it can be a model in your database ORM
"""
if __name__ == "__main__":
from pprint import pprint
fetcher = TestFetcher(cache_fetch=True)
fetcher.url = "http://en.m.wikipedia.org/wiki/Guido_van_Rossum"
# Mappings can be loaded from a json file
# fetcher.load_mappings_from_file('path/to/file')
fetcher.mappings['name'] = {
"css": ("#section_0::text")
}
fetcher.parse()
print "Fetcher holds the data"
print fetcher._data.name
print fetcher._data
# How to populate an object
print "Populating an object"
dummy = DummyModel()
fetcher.populate(dummy, fields=["name", "nationality"])
# fields attr is optional
print dummy.nationality
pprint(dummy.__dict__)
```
# outputs
```
Fetcher holds the data
Guido van Rossum
{'links': [u'http://www.python.org/~guido/',
u'http://neopythonic.blogspot.com/',
u'http://www.artima.com/weblogs/index.jsp?blogger=guido',
u'http://python-history.blogspot.com/',
u'http://www.python.org/doc/essays/cp4e.html',
u'http://www.twit.tv/floss11',
u'http://www.computerworld.com.au/index.php/id;66665771',
u'http://www.stanford.edu/class/ee380/Abstracts/081105.html',
u'http://stanford-online.stanford.edu/courses/ee380/081105-ee380-300.asx'],
'name': u'Guido van Rossum',
'nationality': u'Dutch',
'photo_url': 'http://en.m.wikipedia.org//wiki/File:Guido_van_Rossum_OSCON_2006.jpg',
'url': 'http://en.m.wikipedia.org/wiki/Guido_van_Rossum'}
Populating an object
Dutch
{'name': u'Guido van Rossum', 'nationality': u'Dutch'}
```
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
scrapy_model-0.1.1.tar.gz
(4.8 kB
view details)
Built Distribution
File details
Details for the file scrapy_model-0.1.1.tar.gz
.
File metadata
- Download URL: scrapy_model-0.1.1.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 291c378abd72da5b79ad90a467a86f3121f1ba96f908db762ce14f8784f9b50c |
|
MD5 | b143b8357c5ae28a96877b22e1615739 |
|
BLAKE2b-256 | 4f1b68309b2f6455f0636a4484068f0601bc5faaa2be91484bb67679de35d44a |
File details
Details for the file scrapy_model-0.1.1-py2.py3-none-any.whl
.
File metadata
- Download URL: scrapy_model-0.1.1-py2.py3-none-any.whl
- Upload date:
- Size: 5.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6f34ac4ac50422e5868f0572524aca364c7a318480bd7f7bf1d106cdbc922f60 |
|
MD5 | 11f7ab7e3013ecc7d5e7ac1bf62a42c8 |
|
BLAKE2b-256 | 36700ce739c1b0c6c33d50ccc1948bb1254dea841a50b716b01eed202b15ed35 |