Skip to main content

Playwright integration for Scrapy

Project description

scrapy-playwright: Playwright integration for Scrapy

version pyversions Tests codecov

A Scrapy Download Handler which performs requests using Playwright for Python. It can be used to handle pages that require JavaScript (among other things), while adhering to the regular Scrapy workflow (i.e. without interfering with request scheduling, item processing, etc).

Requirements

After the release of version 2.0, which includes coroutine syntax support and asyncio support, Scrapy allows to integrate asyncio-based projects such as Playwright.

Minimum required versions

  • Python >= 3.7
  • Scrapy >= 2.0 (!= 2.4.0)
  • Playwright >= 1.15

Installation

scrapy-playwright is available on PyPI and can be installed with pip:

pip install scrapy-playwright

playwright is defined as a dependency so it gets installed automatically, however it might be necessary to install the specific browser(s) that will be used:

playwright install

It's also possible to install only a subset of the available browsers:

playwright install firefox chromium

Changelog

Please see the changelog.md file.

Activation

Replace the default http and/or https Download Handlers through DOWNLOAD_HANDLERS:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

Note that the ScrapyPlaywrightDownloadHandler class inherits from the default http/https handler. Unless explicitly marked (see Basic usage), requests will be processed by the regular Scrapy download handler.

Also, be sure to install the asyncio-based Twisted reactor:

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Supported settings

  • PLAYWRIGHT_BROWSER_TYPE (type str, default chromium) The browser type to be launched, e.g. chromium, firefox, webkit.

  • PLAYWRIGHT_LAUNCH_OPTIONS (type dict, default {})

    A dictionary with options to be passed when launching the Browser. See the docs for BrowserType.launch.

  • PLAYWRIGHT_CONTEXTS (type dict[str, dict], default {})

    A dictionary which defines Browser contexts to be created on startup. It should be a mapping of (name, keyword arguments). For instance:

    {
        "foobar": {
            "context_arg1": "value",
            "context_arg2": "value",
        },
        "default": {
            "context_arg1": "value",
            "context_arg2": "value",
        },
        "persistent": {
            "user_data_dir": "/path/to/dir",  # will be a persistent context
            "context_arg1": "value",
        },
    }
    

    See the section on browser contexts for more information.

    See also the docs for Browser.new_context.

  • PLAYWRIGHT_MAX_CONTEXTS (type Optional[int], default None)

    Maximum amount of allowed concurrent Playwright contexts. If unset or None, no limit is enforced. See the Maximum concurrent context count section for more information.

  • PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT (type Optional[float], default None)

    The timeout used when requesting pages by Playwright. If None or unset, the default value will be used (30000 ms at the time of writing this). See the docs for BrowserContext.set_default_navigation_timeout.

  • PLAYWRIGHT_PROCESS_REQUEST_HEADERS (type Optional[Union[Callable, str]], default scrapy_playwright.headers.use_scrapy_headers)

    A function (or the path to a function) that processes headers for a given request and returns a dictionary with the headers to be used (note that, depending on the browser, additional default headers could be sent as well). Coroutine functions (async def) are supported.

    This will be called at least once for each Scrapy request (receiving said request and the corresponding Playwright request), but it could be called additional times if the given resource generates more requests (e.g. to retrieve assets like images or scripts).

    The function must return a dict object, and receives the following keyword arguments:

    - browser_type: str
    - playwright_request: playwright.async_api.Request
    - scrapy_headers: scrapy.http.headers.Headers
    

    The default value (scrapy_playwright.headers.use_scrapy_headers) tries to emulate Scrapy's behaviour for navigation requests, i.e. overriding headers with their values from the Scrapy request. For non-navigation requests (e.g. images, stylesheets, scripts, etc), only the User-Agent header is overriden, for consistency.

    Setting PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None will give complete control of the headers to Playwright, i.e. headers from Scrapy requests will be ignored and only headers set by Playwright will be sent. When doing this, please keep in mind that headers passed via the Request.headers attribute or set by Scrapy components are ignored (including cookies set via the Request.cookies attribute).

  • PLAYWRIGHT_MAX_PAGES_PER_CONTEXT (type int, defaults to the value of Scrapy's CONCURRENT_REQUESTS setting)

    Maximum amount of allowed concurrent Playwright pages for each context. See the notes about leaving unclosed pages.

  • PLAYWRIGHT_ABORT_REQUEST (type Optional[Union[Callable, str]], default None)

    A predicate function (or the path to a function) that receives a playwright.async_api.Request object and must return True if the request should be aborted, False otherwise. Coroutine functions (async def) are supported.

    For instance, the following are all equivalent, and prevent the download of images:

    PLAYWRIGHT_ABORT_REQUEST = lambda req: req.resource_type == "image"
    
    def should_abort_request(req):
        return req.resource_type == "image"
    
    PLAYWRIGHT_ABORT_REQUEST = should_abort_request
    
    # project/utils.py
    def should_abort_request(req):
        return req.resource_type == "image"
    
    # settings.py
    PLAYWRIGHT_ABORT_REQUEST = "project.utils.should_abort_request"
    

    Please note that all requests will appear in the DEBUG level logs, however there will be no corresponding response log lines for aborted requests. Aborted requests are counted in the playwright/request_count/aborted job stats item.

    General note about settings

    For the settings which accept object paths as strings, passing callable objects is only supported when using Scrapy>=2.4. With prior versions, only strings are supported.

Supported request meta keys

  • playwright (type bool, default False)

    If set to a value that evaluates to True the request will be processed by Playwright.

  • playwright_context (type str, default "default")

    Name of the context to be used to downloaad the request. See the section on browser contexts for more information.

  • playwright_context_kwargs (type dict, default {})

    A dictionary with keyword arguments to be used when creating a new context, if a context with the name specified in the playwright_context meta key does not exist already. See the section on browser contexts for more information.

  • playwright_include_page (type bool, default False)

    If True, the Playwright page that was used to download the request will be available in the callback via response.meta['playwright_page'].

    Important!

    This meta key is entirely optional, it's NOT necessary for the page to load or for any asynchronous operation to be performed (specifically, it's NOT necessary for PageMethod objects to be applied). Use it only if you need access to the Page object in the callback that handles the request.

    For more information and important notes see Receiving Page objects in callbacks.

  • playwright_page_event_handlers (type Optional[Dict[Str, Callable]], default None)

    A dictionary of handlers to be attached to page events. See Handling page events.

  • playwright_page_init_callback (type Optional[Union[Callable, str]], default None)

    A coroutine function (async def) to be invoked immediately after creating a page for the request. It receives the page and the request as positional arguments. Useful for initialization code. Invoked only for newly created pages, ignored if the page for the request already exists (e.g. by passing playwright_page).

    async def init_page(page, request):
        await page.add_init_script(path="./custom_script.js")
    
    class AwesomeSpider(scrapy.Spider):
        def start_requests(self):
            yield scrapy.Request(
                url="https://httpbin.org/headers",
                meta={
                    "playwright": True,
                    "playwright_page_init_callback": init_page,
                },
            )
    

    Important!

    scrapy-playwright uses Page.route & Page.unroute internally, please avoid using these methods unless you know exactly what you're doing.

  • playwright_page_methods (type Iterable, default ())

    An iterable of scrapy_playwright.page.PageMethod objects to indicate actions to be performed on the page before returning the final response. For more information see Executing actions on pages.

  • playwright_page (type Optional[playwright.async_api._generated.Page], default None)

    A Playwright page to be used to download the request. If unspecified, a new page is created for each request. This key could be used in conjunction with playwright_include_page to make a chain of requests using the same page. For instance:

    def start_requests(self):
        yield scrapy.Request(
            url="https://httpbin.org/get",
            meta={"playwright": True, "playwright_include_page": True},
        )
    
    def parse(self, response):
        page = response.meta["playwright_page"]
        yield scrapy.Request(
            url="https://httpbin.org/headers",
            callback=self.parse_headers,
            meta={"playwright": True, "playwright_page": page},
        )
    
  • playwright_page_goto_kwargs (type dict, default {})

    A dictionary with keyword arguments to be passed to the page's goto method when navigating to an URL. The url key is ignored if present, the request's URL is used instead.

  • playwright_security_details (type Optional[dict], read only)

    A dictionary with security information about the give response. Only available for HTTPS requests. Could be accessed in the callback via response.meta['playwright_security_details']

Basic usage

Set the playwright Request.meta key to download a request using Playwright:

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "awesome"

    def start_requests(self):
        # GET request
        yield scrapy.Request("https://httpbin.org/get", meta={"playwright": True})
        # POST request
        yield scrapy.FormRequest(
            url="https://httpbin.org/post",
            formdata={"foo": "bar"},
            meta={"playwright": True},
        )

    def parse(self, response):
        # 'response' contains the page as seen by the browser
        yield {"url": response.url}

Notes about the User-Agent header

By default, outgoing requests include the User-Agent set by Scrapy (either with the USER_AGENT or DEFAULT_REQUEST_HEADERS settings or via the Request.headers attribute). This could cause some sites to react in unexpected ways, for instance if the user agent does not match the running Browser. If you prefer the User-Agent sent by default by the specific browser you're using, set the Scrapy user agent to None.

Receiving Page objects in callbacks

Specifying a non-False value for the playwright_include_page meta key for a request will result in the corresponding playwright.async_api.Page object being available in the playwright_page meta key in the request callback. In order to be able to await coroutines on the provided Page object, the callback needs to be defined as a coroutine function (async def).

Caution

Use this carefully, and only if you really need to do things with the Page object in the callback. If pages are not properly closed after they are no longer necessary the spider job could get stuck because of the limit set by the PLAYWRIGHT_MAX_PAGES_PER_CONTEXT setting.

import scrapy

class AwesomeSpiderWithPage(scrapy.Spider):
    name = "page_spider"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            callback=self.parse_first,
            meta={"playwright": True, "playwright_include_page": True},
            errback=self.errback_close_page,
        )

    def parse_first(self, response):
        page = response.meta["playwright_page"]
        return scrapy.Request(
            url="https://example.com",
            callback=self.parse_second,
            meta={"playwright": True, "playwright_include_page": True, "playwright_page": page},
            errback=self.errback_close_page,
        )

    async def parse_second(self, response):
        page = response.meta["playwright_page"]
        title = await page.title()  # "Example Domain"
        await page.close()
        return {"title": title}

    async def errback_close_page(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

Notes:

  • When passing playwright_include_page=True, make sure pages are always closed when they are no longer used. It's recommended to set a Request errback to make sure pages are closed even if a request fails (if playwright_include_page=False or unset, pages are automatically closed upon encountering an exception). This is important, as open pages count towards the limit set by PLAYWRIGHT_MAX_PAGES_PER_CONTEXT and crawls could freeze if the limit is reached and pages remain open indefinitely.
  • Defining callbacks as async def is only necessary if you need to await things, it's NOT necessary if you just need to pass over the Page object from one callback to another (see the example above).
  • Any network operations resulting from awaiting a coroutine on a Page object (goto, go_back, etc) will be executed directly by Playwright, bypassing the Scrapy request workflow (Scheduler, Middlewares, etc).

Browser contexts

Multiple browser contexts to be launched at startup can be defined via the PLAYWRIGHT_CONTEXTS setting.

Choosing a specific context for a request

Pass the name of the desired context in the playwright_context meta key:

yield scrapy.Request(
    url="https://example.org",
    meta={"playwright": True, "playwright_context": "first"},
)

Default context

If a request does not explicitly indicate a context via the playwright_context meta key, it falls back to using a general context called default. This default context can also be customized on startup via the PLAYWRIGHT_CONTEXTS setting.

Persistent contexts

Pass a value for the user_data_dir keyword argument to launch a context as persistent (see BrowserType.launch_persistent_context).

Creating a context during a crawl

If the context specified in the playwright_context meta key does not exist, it will be created. You can specify keyword arguments to be passed to Browser.new_context in the playwright_context_kwargs meta key:

yield scrapy.Request(
    url="https://example.org",
    meta={
        "playwright": True,
        "playwright_context": "new",
        "playwright_context_kwargs": {
            "java_script_enabled": False,
            "ignore_https_errors": True,
            "proxy": {
                "server": "http://myproxy.com:3128",
                "username": "user",
                "password": "pass",
            },
        },
    },
)

Please note that if a context with the specified name already exists, that context is used and playwright_context_kwargs are ignored.

Closing a context during a crawl

After receiving the Page object in your callback, you can access a context though the corresponding Page.context attribute, and await close on it.

def parse(self, response):
    yield scrapy.Request(
        url="https://example.org",
        callback=self.parse_in_new_context,
        errback=self.close_context_on_error,
        meta={"playwright": True, "playwright_context": "new", "playwright_include_page": True},
    )

async def parse_in_new_context(self, response):
    page = response.meta["playwright_page"]
    title = await page.title()
    await page.context.close()
    return {"title": title}

async def close_context_on_error(self, failure):
    self.logger.warning("There was an error when processing %s: %s", failure.request, failure.value)
    page = failure.request.meta["playwright_page"]
    await page.context.close()

Maximum concurrent context count

Specify a value for the PLAYWRIGHT_MAX_CONTEXTS setting to limit the amount of concurent contexts. This setting should be used with caution: it's possible to block the whole crawl if contexts are not closed after they are no longer used (refer to the above section to dinamically close contexts). Make sure to define an errback to still be able to close the context even if there are errors with a request.

Proxy support

Proxies are supported at the Browser level by specifying the proxy key in the PLAYWRIGHT_LAUNCH_OPTIONS setting:

from scrapy import Spider, Request

class ProxySpider(Spider):
    name = "proxy"
    custom_settings = {
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                "server": "http://myproxy.com:3128"
                "username": "user",
                "password": "pass",
            },
        }
    }

    def start_requests(self):
        yield Request("http://httpbin.org/get", meta={"playwright": True})

    def parse(self, response):
        print(response.text)

You can also set proxies per context with the PLAYWRIGHT_CONTEXTS setting:

PLAYWRIGHT_CONTEXTS = {
    "default": {
        "proxy": {
            "server": "http://default-proxy.com:3128",
            "username": "user1",
            "password": "pass1",
        },
    },
    "alternative": {
        "proxy": {
            "server": "http://alternative-proxy.com:3128",
            "username": "user2",
            "password": "pass2",
        },
    },
}

Or passing a proxy key when creating a context during a crawl.

See also:

Executing actions on pages

A sorted iterable (list, tuple or dict, for instance) could be passed in the playwright_page_methods Request.meta key to request coroutines to be awaited on the Page before returning the final Response to the callback.

This is useful when you need to perform certain actions on a page, like scrolling down or clicking links, and you want to handle only the final result in your callback.

PageMethod class

scrapy_playwright.page.PageMethod(method: str, *args, **kwargs):

Represents a method to be called (and awaited if necessary) on a playwright.page.Page object, such as "click", "screenshot", "evaluate", etc. method is the name of the method, *args and **kwargs are passed when calling such method. The return value will be stored in the PageMethod.result attribute.

For instance:

def start_requests(self):
    yield Request(
        url="https://example.org",
        meta={
            "playwright": True,
            "playwright_page_methods": [
                PageMethod("screenshot", path="example.png", full_page=True),
            ],
        },
    )

def parse(self, response):
    screenshot = response.meta["playwright_page_methods"][0]
    # screenshot.result contains the image's bytes

produces the same effect as:

def start_requests(self):
    yield Request(
        url="https://example.org",
        meta={"playwright": True, "playwright_include_page": True},
    )

async def parse(self, response):
    page = response.meta["playwright_page"]
    screenshot = await page.screenshot(path="example.png", full_page=True)
    # screenshot contains the image's bytes
    await page.close()

Supported methods

Please refer to the upstream docs for the Page class to see available methods.

Impact on Response objects

Certain Response attributes (e.g. url, ip_address) reflect the state after the last action performed on a page. If you issue a PageMethod with an action that results in a navigation (e.g. a click on a link), the Response.url attribute will point to the new URL, which might be different from the request's URL.

Handling page events

A dictionary of Page event handlers can be specified in the playwright_page_event_handlers Request.meta key. Keys are the name of the event to be handled (dialog, download, etc). Values can be either callables or strings (in which case a spider method with the name will be looked up).

Example:

from playwright.async_api import Dialog

async def handle_dialog(dialog: Dialog) -> None:
    logging.info(f"Handled dialog with message: {dialog.message}")
    await dialog.dismiss()

class EventSpider(scrapy.Spider):
    name = "event"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta=dict(
                playwright=True,
                playwright_page_event_handlers={
                    "dialog": handle_dialog,
                    "response": "handle_response",
                },
            ),
        )

    async def handle_response(self, response: PlaywrightResponse) -> None:
        logging.info(f"Received response with URL {response.url}")

See the upstream Page docs for a list of the accepted events and the arguments passed to their handlers.

Note: keep in mind that, unless they are removed later, these handlers will remain attached to the page and will be called for subsequent downloads using the same page. This is usually not a problem, since by default requests are performed in single-use pages.

Examples

Click on a link, save the resulting page as PDF

class ClickAndSavePdfSpider(scrapy.Spider):
    name = "pdf"

    def start_requests(self):
        yield scrapy.Request(
            url="https://example.org",
            meta=dict(
                playwright=True,
                playwright_page_methods={
                    "click": PageMethod("click", selector="a"),
                    "pdf": PageMethod("pdf", path="/tmp/file.pdf"),
                },
            ),
        )

    def parse(self, response):
        pdf_bytes = response.meta["playwright_page_methods"]["pdf"].result
        with open("iana.pdf", "wb") as fp:
            fp.write(pdf_bytes)
        yield {"url": response.url}  # response.url is "https://www.iana.org/domains/reserved"

Scroll down on an infinite scroll page, take a screenshot of the full page

class ScrollSpider(scrapy.Spider):
    name = "scroll"

    def start_requests(self):
        yield scrapy.Request(
            url="http://quotes.toscrape.com/scroll",
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_page_methods=[
                    PageMethod("wait_for_selector", "div.quote"),
                    PageMethod("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                    PageMethod("wait_for_selector", "div.quote:nth-child(11)"),  # 10 per page
                ],
            ),
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.screenshot(path="quotes.png", full_page=True)
        await page.close()
        return {"quote_count": len(response.css("div.quote"))}  # quotes from several pages

For more examples, please see the scripts in the examples directory.

Known issues

Lack of native support for Windows

This package does not work natively on Windows. This is because:

  • Playwright runs the driver in a subprocess. Source: Playwright repository.
  • "On Windows, the default event loop ProactorEventLoop supports subprocesses, whereas SelectorEventLoop does not". Source: Python docs.
  • Twisted's asyncio reactor requires the SelectorEventLoop. Source: Twisted repository.

Some users have reported having success running under WSL. See also #78 for information about working in headful mode under WSL.

No per-request proxy support

Specifying a proxy via the proxy Request meta key is not supported. Refer to the Proxy support section for more information.

Unsopported signals

The headers_received and bytes_received signals are not fired by the scrapy-playwright download handler.

Deprecation policy

Deprecated features will be supported for at least six months following the release that deprecated them. After that, they may be removed at any time. See the changelog for more information about deprecations and removals.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapy-playwright-0.0.23.tar.gz (25.0 kB view details)

Uploaded Source

Built Distribution

scrapy_playwright-0.0.23-py3-none-any.whl (17.6 kB view details)

Uploaded Python 3

File details

Details for the file scrapy-playwright-0.0.23.tar.gz.

File metadata

  • Download URL: scrapy-playwright-0.0.23.tar.gz
  • Upload date:
  • Size: 25.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.0

File hashes

Hashes for scrapy-playwright-0.0.23.tar.gz
Algorithm Hash digest
SHA256 ad423c1a13b976e91412fadd476a0438e0d9a0a091b2b055940d8829da7da2f1
MD5 49b23eefc6f0e9765a5f5d131588216d
BLAKE2b-256 6dcf484593707a57e512baf1a76c6b2efeb0ad93be8babd73fb1ae1ca5fa6ef3

See more details on using hashes here.

File details

Details for the file scrapy_playwright-0.0.23-py3-none-any.whl.

File metadata

File hashes

Hashes for scrapy_playwright-0.0.23-py3-none-any.whl
Algorithm Hash digest
SHA256 c7a8b43f5ff31b437f1a904aaf32fd90c85a1b51347208e7c3340d502142e84a
MD5 0dc4b2385a7870faa089281e36b10582
BLAKE2b-256 3305304998815eb8df5771e322991b5b2e1b1a516336dc515e8f9011c555ec43

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page