Skip to main content

Scrapy schema validation pipeline and Item builder using JSON Schema

Project description

https://travis-ci.org/scrapy-plugins/scrapy-jsonschema.svg?branch=master https://codecov.io/gh/scrapy-plugins/scrapy-jsonschema/branch/master/graph/badge.svg

This plugin provides two features based on JSON Schema and the jsonschema Python library:

Installation

Install scrapy-jsonschema using pip:

$ pip install scrapy-jsonschema

Configuration

Add JsonSchemaValidatePipeline by including it in ITEM_PIPELINES in your settings.py file:

ITEM_PIPELINES = {
    ...
    'scrapy_jsonschema.JsonSchemaValidatePipeline': 100,
}

Here, priority 100 is just an example. Set its value depending on other pipelines you may have enabled already.

Usage

Let’s assume that you are working with this JSON schema below, representing products each requiring a numeric ID, a name, and a non-negative price (this example is taken from JSON Schema website):

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "Product",
    "description": "A product from Acme's catalog",
    "type": "object",
    "properties": {
        "id": {
            "description": "The unique identifier for a product",
            "type": "integer"
        },
        "name": {
            "description": "Name of the product",
            "type": "string"
        },
        "price": {
            "type": "number",
            "minimum": 0,
            "exclusiveMinimum": true
        }
    },
    "required": ["id", "name", "price"]
}

You can define a scrapy.Item from this schema by subclassing scrapy_jsonschema.item.JsonSchemaItem, and setting a jsonschema class attribute set to the schema. This attribute should be a Python dict – note that JSON’s “true” became True below; you can use Python’s json module to load a JSON Schema as string):

from scrapy_jsonschema.item import JsonSchemaItem


class ProductItem(JsonSchemaItem):
    jsonschema =     {
        "$schema": "http://json-schema.org/draft-04/schema#",
        "title": "Product",
        "description": "A product from Acme's catalog",
        "type": "object",
        "properties": {
            "id": {
                "description": "The unique identifier for a product",
                "type": "integer"
            },
            "name": {
                "description": "Name of the product",
                "type": "string"
            },
            "price": {
                "type": "number",
                "minimum": 0,
                "exclusiveMinimum": true
            }
        },
        "required": ["id", "name", "price"]
    }

You can then use this item class as any regular Scrapy item (notice how fields that are not in the schema raise errors when assigned):

>>> item = ProductItem()
>>> item['foo'] = 3
(...)
KeyError: 'ProductItem does not support field: foo'

>>> item['name'] = 'Some name'
>>> item['name']
'Some name'

If you use this item definition in a spider and if the pipeline is enabled, generated items that do no follow the schema will be dropped. In the (unrealistic) example spider below, one of the items only contains the “name”, and “id” and “price” are missing:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = ['http://example.com/']

    def parse(self, response):
        yield ProductItem({
            "name": response.css('title::text').extract_first()
        })

        yield ProductItem({
            "id": 1,
            "name": response.css('title::text').extract_first(),
            "price": 9.99
        })

When running this spider, when the item with missing fields is output, you should see these lines appear in the logs:

2017-01-20 12:34:23 [scrapy.core.scraper] WARNING: Dropped: schema validation failed:
 id: 'id' is a required property
price: 'price' is a required property

{'name': u'Example Domain'}

The second item conforms to the schema so it appears as a regular item log:

2017-01-20 12:34:23 [scrapy.core.scraper] DEBUG: Scraped from <200 http://example.com/>
{'id': 1, 'name': u'Example Domain', 'price': 9.99}

The item pipeline also updates Scrapy stats with a few counters, under jsonschema/ namespace:

2017-01-20 12:34:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{...
 'item_dropped_count': 1,
 'item_dropped_reasons_count/DropItem': 1,
 'item_scraped_count': 1,
 'jsonschema/errors/id': 1,
 'jsonschema/errors/price': 1,
 ...}
2017-01-20 12:34:23 [scrapy.core.engine] INFO: Spider closed (finished)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-jsonschema-0.1.0.tar.gz (3.9 kB view details)

Uploaded Source

Built Distribution

scrapy_jsonschema-0.1.0-py2.py3-none-any.whl (3.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file scrapy-jsonschema-0.1.0.tar.gz.

File metadata

File hashes

Hashes for scrapy-jsonschema-0.1.0.tar.gz
Algorithm Hash digest
SHA256 afc5aa10925f9fb2b7c5664401bfba47295c25aa5280d117541836959398a54e
MD5 0db6dea53842cdb4378150d058d1cdee
BLAKE2b-256 84d14d9888e9c74d73dc4b976defaeee1a1b0da3eb0fb35a9894f84718dd38a2

See more details on using hashes here.

File details

Details for the file scrapy_jsonschema-0.1.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_jsonschema-0.1.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 c2e92b4ab105832f38398a21ff8358d1bcfb829c389092ddeeed3a85adccd5cb
MD5 d072b4420b692e5f073a83332dded860
BLAKE2b-256 d04ec01e643fb80b5414866ce5ff9233eff23273d33a683dcbb0c277c76999c6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page