Skip to main content

Pyppeteer integration for Scrapy

Project description

Pyppeteer integration for Scrapy

version pyversions actions codecov

This project provides a Scrapy Download Handler which performs requests using Pyppeteer. It can be used to handle pages that require JavaScript. This package does not interfere with regular Scrapy workflows such as request scheduling or item processing.

Motivation

After the release of version 2.0, which includes partial coroutine syntax support and experimental asyncio support, Scrapy allows to integrate asyncio-based projects such as Pyppeteer.

Requirements

  • Python 3.6+
  • Scrapy 2.0+
  • Pyppeteer 0.0.23+

Installation

$ pip install scrapy-pyppeteer

Configuration

Replace the default http and https Download Handlers through DOWNLOAD_HANDLERS:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_pyppeteer.ScrapyPyppeteerDownloadHandler",
    "https": "scrapy_pyppeteer.ScrapyPyppeteerDownloadHandler",
}

Note that the ScrapyPyppeteerDownloadHandler class inherits from the default http/https handler, and it will only use Pyppeteer for requests that are explicitly marked (see the "Basic usage" section for details).

Also, be sure to install the asyncio-based Twisted reactor:

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

scrapy-pyppeteer accepts the following settings:

  • PYPPETEER_LAUNCH_OPTIONS (type dict, default {})

    A dictionary with options to be passed when launching the Browser. See the docs for pyppeteer.launcher.launch

  • PYPPETEER_NAVIGATION_TIMEOUT (type Optional[int], default None)

    The timeout used when requesting pages by Pyppeteer. If None or unset, the default value will be used (30000 ms at the time of writing this). See the docs for pyppeteer.page.Page.setDefaultNavigationTimeout

Basic usage

Set the pyppeteer Request.meta key to download a request using Pyppeteer:

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "awesome"

    def start_requests(self):
        yield scrapy.Request("https://example.org", meta={"pyppeteer": True})

    def parse(self, response):
        return response.follow_all(css="a", meta={"pyppeteer": True})

Receiving the Page object in the callback

Specifying pyppeteer.page.Page as the type for a callback argument will result in the corresponding Page object being injected in the callback. In order to able to await coroutines on the provided Page object, the callback needs to be defined as a coroutine function (async def).

import scrapy
import pyppeteer

class AwesomeSpiderWithPage(scrapy.Spider):
    name = "page"

    def start_requests(self):
        yield scrapy.Request("https://example.org", meta={"pyppeteer": True})

    async def parse(self, response, page: pyppeteer.page.Page):
        title = await page.title()  # "Example Domain"
        yield {"title": title}
        await page.close()

Notes:

  • In order to avoid memory issues, it is recommended to manually close the page by awaiting the Page.close coroutine.
  • Any network operations resulting from awaiting a coroutine on a Page object (goto, goBack, etc) will be executed directly by Pyppeteer, bypassing the Scrapy request workflow (Scheduler, Middlewares, etc).

Page coroutines

A sorted iterable could be passed in the pyppeteer_page_coroutines Request.meta key to request coroutines to be awaited on the Page before returning the final Response to the callback.

This is useful when you need to perform certain actions on a page, like scrolling down or clicking links, and you want everything to count as a single Scrapy Response, containing the final result.

Supported actions

  • scrapy_pyppeteer.page.PageCoroutine(method: str, *args, **kwargs):

    Represents a coroutine to be awaited on a pyppeteer.page.Page object, such as "click", "screenshot", "evaluate", etc. method should be the name of the coroutine, *args and **kwargs are passed to the function call.

    For instance,

    PageCoroutine("screenshot", options={"path": "quotes.png", "fullPage": True})
    

    produces the same effect as:

    # 'page' is a pyppeteer.page.Page object
    await page.screenshot(options={"path": "quotes.png", "fullPage": True})
    
  • scrapy_pyppeteer.page.NavigationPageCoroutine(method: str, *args, **kwargs):

    Subclass of PageCoroutine. It waits for a navigation event: use this when you know a coroutine will trigger a navigation event, for instance when clicking on a link. This forces a Page.waitForNavigation() call wrapped in asyncio.gather, as recommended in the Pyppeteer docs.

    For instance,

    NavigationPageCoroutine("click", selector="a")
    

    produces the same effect as:

    # 'page' is a pyppeteer.page.Page object
    await asyncio.gather(
        page.waitForNavigation(),
        page.click(selector="a"),
    )
    

Examples

Click on a link, save the resulting page as PDF

class ClickAndSavePdfSpider(scrapy.Spider):
    name = "pdf"

    def start_requests(self):
        yield Request(
            url="https://example.org",
            meta=dict(
                pyppeteer=True,
                pyppeteer_page_coroutines=[
                    NavigationPageCoroutine("click", selector="a"),
                    PageCoroutine("pdf", options={"path": "iana.pdf"}),
                ],
            ),
        )

    def parse(self, response):
        yield {"url": response.url}  # response.url is "https://www.iana.org/domains/reserved"

Scroll down on an infinite scroll page, take a screenshot of the full page

class ScrollSpider(scrapy.Spider):
    name = "scroll"

    def start_requests(self):
        yield Request(
            url="http://quotes.toscrape.com/scroll",
            meta=dict(
                pyppeteer=True,
                pyppeteer_page_coroutines=[
                    PageCoroutine("waitForSelector", "div.quote"),
                    PageCoroutine("evaluate", "window.scrollBy(0, 2000)"),
                    PageCoroutine("waitForSelector", "div.quote:nth-child(11)"),  # 10 per page
                ],
            ),
        )

    async def parse(self, response, page: pyppeteer.page.Page):
        await page.screenshot(options={"path": "quotes.png", "fullPage": True})
        yield {"quote_count": len(response.css("div.quote"))}  # 100 quotes

Acknowledgements

This project was inspired by:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-pyppeteer-0.0.4.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

scrapy_pyppeteer-0.0.4-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-pyppeteer-0.0.4.tar.gz.

File metadata

  • Download URL: scrapy-pyppeteer-0.0.4.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for scrapy-pyppeteer-0.0.4.tar.gz
Algorithm Hash digest
SHA256 13ea7be6f209ca4d099a86df7e7baa3bb3926409a75758a8a3327079d6786486
MD5 c38aef8c2dfc2ef9f9d2469445859fca
BLAKE2b-256 c909a30aa947131da57cde0b1bec58febf11e379092a893408d5af2fb0822d19

See more details on using hashes here.

File details

Details for the file scrapy_pyppeteer-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: scrapy_pyppeteer-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 7.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for scrapy_pyppeteer-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 030dbf20187b836e0468d072450d000a3ce1584b50edfba1b4e549b7840b114d
MD5 07f676a4c633fe2a011c6cf29cc65252
BLAKE2b-256 e77a235ec77450d1feb5efbeb5ebe9938db3eff33c3757f3b6ce2ce740a0a4c1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page