Skip to main content

Scrapy extension to write scraped items using Django models

Project description

PyPI Version Build Status License

scrapy-djangoitem is an extension that allows you to define Scrapy items using existing Django models.

This utility provides a new class, named DjangoItem, that you can use as a regular Scrapy item and link it to a Django model with its django_model attribute. Start using it right away by importing it from this package:

from scrapy_djangoitem import DjangoItem

DjangoItem

DjangoItem is a class of item that gets its fields definition from a Django model, you simply create a DjangoItem and specify what Django model it relates to.

Besides of getting the model fields defined on your item, DjangoItem provides a method to create and populate a Django model instance with the item data.

Using DjangoItem

DjangoItem works much like ModelForms in Django, you create a subclass and define its django_model attribute to be a valid Django model. With this you will get an item with a field for each Django model field.

In addition, you can define fields that aren’t present in the model and even override fields that are present in the model defining them in the item.

Let’s see some examples:

Creating a Django model for the examples:

from django.db import models

class Person(models.Model):
    name = models.CharField(max_length=255)
    age = models.IntegerField()

Defining a basic DjangoItem:

from scrapy_djangoitem import DjangoItem

class PersonItem(DjangoItem):
    django_model = Person

DjangoItem work just like Scrapy items:

>>> p = PersonItem()
>>> p['name'] = 'John'
>>> p['age'] = '22'

To obtain the Django model from the item, we call the extra method DjangoItem.save() of the DjangoItem:

>>> person = p.save()
>>> person.name
'John'
>>> person.age
'22'
>>> person.id
1

The model is already saved when we call DjangoItem.save(), we can prevent this by calling it with commit=False. We can use commit=False in DjangoItem.save() method to obtain an unsaved model:

>>> person = p.save(commit=False)
>>> person.name
'John'
>>> person.age
'22'
>>> person.id
None

As said before, we can add other fields to the item:

import scrapy
from scrapy_djangoitem import DjangoItem

class PersonItem(DjangoItem):
    django_model = Person
    sex = scrapy.Field()
>>> p = PersonItem()
>>> p['name'] = 'John'
>>> p['age'] = '22'
>>> p['sex'] = 'M'

And we can override the fields of the model with your own:

class PersonItem(DjangoItem):
    django_model = Person
    name = scrapy.Field(default='No Name')

This is useful to provide properties to the field, like a default or any other property that your project uses. Those additional fields won’t be taken into account when doing a DjangoItem.save().

DjangoItem caveats

DjangoItem is a rather convenient way to integrate Scrapy projects with Django models, but bear in mind that Django ORM may not scale well if you scrape a lot of items (ie. millions) with Scrapy. This is because a relational backend is often not a good choice for a write intensive application (such as a web crawler), specially if the database is highly normalized and with many indices.

Django settings set up

To use the Django models outside the Django application you need to set up the DJANGO_SETTINGS_MODULE environment variable and –in most cases– modify the PYTHONPATH environment variable to be able to import the settings module.

There are many ways to do this depending on your use case and preferences. Below is detailed one of the simplest ways to do it.

Suppose your Django project is named mysite, is located in the path /home/projects/mysite and you have created an app myapp with the model Person. That means your directory structure is something like this:

/home/projects/mysite
├── manage.py
├── myapp
│   ├── __init__.py
│   ├── models.py
│   ├── tests.py
│   └── views.py
└── mysite
    ├── __init__.py
    ├── settings.py
    ├── urls.py
    └── wsgi.py

Then you need to add /home/projects/mysite to the PYTHONPATH environment variable and set up the environment variable DJANGO_SETTINGS_MODULE to mysite.settings. That can be done in your Scrapy’s settings file by adding the lines below:

import sys
sys.path.append('/home/projects/mysite')

import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings'

Notice that we modify the sys.path variable instead the PYTHONPATH environment variable as we are already within the python runtime. If everything is right, you should be able to start the scrapy shell command and import the model Person (i.e. from myapp.models import Person).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-djangoitem-1.0.0.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

scrapy_djangoitem-1.0.0-py2-none-any.whl (7.3 kB view details)

Uploaded Python 2

File details

Details for the file scrapy-djangoitem-1.0.0.tar.gz.

File metadata

File hashes

Hashes for scrapy-djangoitem-1.0.0.tar.gz
Algorithm Hash digest
SHA256 bac7c14fcf8ea2ab4c4f426380316ffb636a4c43fd1161044960091c6fe61f34
MD5 b79f9aada685e24bd38e2598dd152c5e
BLAKE2b-256 9faa71bc883e6378e3bf09b718a8567210a846d1291a0e4b96af2809acf74d38

See more details on using hashes here.

Provenance

File details

Details for the file scrapy_djangoitem-1.0.0-py2-none-any.whl.

File metadata

File hashes

Hashes for scrapy_djangoitem-1.0.0-py2-none-any.whl
Algorithm Hash digest
SHA256 0694b7248a6b899129e445b91ba3ed207c554c1443bf70bf57bce33b47143662
MD5 67d153c38d8a0888ed54ccf6b62dd796
BLAKE2b-256 e3a1619e9c5023cde8acddeb57781d87808146cae4679f5c8a5e4510d3f39fd1

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page