Playwright integration for Scrapy
Project description
Playwright integration for Scrapy
This project provides a Scrapy Download Handler which performs requests using Playwright for Python. It can be used to handle pages that require JavaScript. This package does not interfere with regular Scrapy workflows such as request scheduling or item processing.
Motivation
After the release of version 2.0,
which includes partial coroutine syntax support
and experimental asyncio support, Scrapy allows
to integrate asyncio
-based projects such as Playwright
.
Requirements
- Python >= 3.7
- Scrapy >= 2.0 (!= 2.4.0)
- Playwright >= 1.8.0a1
Installation
$ pip install scrapy-playwright
Changelog
Please see the changelog.md file.
Configuration
Replace the default http
and https
Download Handlers through
DOWNLOAD_HANDLERS
:
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
Note that the ScrapyPlaywrightDownloadHandler
class inherits from the default
http/https
handler, and it will only use Playwright for requests that are
explicitly marked (see the "Basic usage" section for details).
Also, be sure to install the asyncio
-based Twisted reactor:
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
Settings
scrapy-playwright
accepts the following settings:
-
PLAYWRIGHT_BROWSER_TYPE
(typestr
, defaultchromium
) The browser type to be launched. Valid values are (chromium
,firefox
,webkit
). -
PLAYWRIGHT_LAUNCH_OPTIONS
(typedict
, default{}
)A dictionary with options to be passed when launching the Browser. See the docs for
BrowserType.launch
. -
PLAYWRIGHT_CONTEXT_ARGS
(typedict
, default{}
)A dictionary with default keyword arguments to be passed when creating the "default" Browser context.
Deprecated: use
PLAYWRIGHT_CONTEXTS
instead -
PLAYWRIGHT_CONTEXTS
(typedict[str, dict]
, default{}
)A dictionary which defines Browser contexts to be created on startup. It should be a mapping of (name, keyword arguments) For instance:
{ "first": { "context_arg1": "value", "context_arg2": "value", }, "second": { "context_arg1": "value", }, }
If no contexts are defined, a default context (called
default
) is created. The arguments passed here take precedence over the ones defined inPLAYWRIGHT_CONTEXT_ARGS
. See the docs forBrowser.new_context
. -
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT
(typeOptional[int]
, defaultNone
)The timeout used when requesting pages by Playwright. If
None
or unset, the default value will be used (30000 ms at the time of writing this). See the docs for BrowserContext.set_default_navigation_timeout.
Basic usage
Set the playwright
Request.meta
key to download a request using Playwright:
import scrapy
class AwesomeSpider(scrapy.Spider):
name = "awesome"
def start_requests(self):
# GET request
yield scrapy.Request("https://httpbin.org/get", meta={"playwright": True})
# POST request
yield scrapy.FormRequest(
url="https://httpbin.org/post",
formdata={"foo": "bar"},
meta={"playwright": True},
)
def parse(self, response):
# 'response' contains the page as seen by the browser
yield {"url": response.url}
Receiving the Page object in the callback
Specifying a non-False value for the playwright_include_page
meta
key for a
request will result in the corresponding playwright.async_api.Page
object
being available in the playwright_page
meta key in the request callback.
In order to be able to await
coroutines on the provided Page
object,
the callback needs to be defined as a coroutine function (async def
).
import scrapy
import playwright
class AwesomeSpiderWithPage(scrapy.Spider):
name = "page"
def start_requests(self):
yield scrapy.Request(
url="https://example.org",
meta={"playwright": True, "playwright_include_page": True},
)
async def parse(self, response):
page = response.meta["playwright_page"]
title = await page.title() # "Example Domain"
await page.close()
return {"title": title}
Notes:
- In order to avoid memory issues, it is recommended to manually close the page
by awaiting the
Page.close
coroutine. - Any network operations resulting from awaiting a coroutine on a
Page
object (goto
,goBack
, etc) will be executed directly by Playwright, bypassing the Scrapy request workflow (Scheduler, Middlewares, etc).
Multiple browser contexts
Multiple browser contexts
to be launched at startup can be defined via the PLAYWRIGHT_CONTEXTS
setting.
Choosing a specific context for a request
Pass the name of the desired context in the playwright_context
meta key:
yield scrapy.Request(
url="https://example.org",
meta={"playwright": True, "playwright_context": "first"},
)
Creating a context during a crawl
If the context specified in the playwright_context
meta key does not exist, it will be created.
You can specify keyword arguments to be passed to
Browser.new_context
in the playwright_context_kwargs
meta key:
yield scrapy.Request(
url="https://example.org",
meta={
"playwright": True,
"playwright_context": "new",
"playwright_context_kwargs": {
"java_script_enabled": False,
"ignore_https_errors": True,
"proxy": {
"server": "http://myproxy.com:3128",
"username": "user",
"password": "pass",
},
},
},
)
Please note that if a context with the specified name already exists,
that context is used and playwright_context_kwargs
are ignored.
Closing a context during a crawl
After receiving the Page object in your callback,
you can access a context though the corresponding Page.context
attribute, and await close
on it.
def parse(self, response):
yield scrapy.Request(
url="https://example.org",
callback=self.parse_in_new_context,
meta={"playwright": True, "playwright_context": "new", "playwright_include_page": True},
)
async def parse_in_new_context(self, response):
page = response.meta["playwright_page"]
title = await page.title()
await page.context.close() # close the context
await page.close()
return {"title": title}
Page coroutines
A sorted iterable (list
, tuple
or dict
, for instance) could be passed
in the playwright_page_coroutines
Request.meta
key to request coroutines to be awaited on the Page
before returning the final
Response
to the callback.
This is useful when you need to perform certain actions on a page, like scrolling down or clicking links, and you want everything to count as a single Scrapy Response, containing the final result.
Supported actions
-
scrapy_playwright.page.PageCoroutine(method: str, *args, **kwargs)
:Represents a coroutine to be awaited on a
playwright.page.Page
object, such as "click", "screenshot", "evaluate", etc.method
should be the name of the coroutine,*args
and**kwargs
are passed to the function call.The coroutine result will be stored in the
PageCoroutine.result
attributeFor instance,
PageCoroutine("screenshot", path="quotes.png", fullPage=True)
produces the same effect as:
# 'page' is a playwright.async_api.Page object await page.screenshot(path="quotes.png", fullPage=True)
Examples
Click on a link, save the resulting page as PDF
class ClickAndSavePdfSpider(scrapy.Spider):
name = "pdf"
def start_requests(self):
yield scrapy.Request(
url="https://example.org",
meta=dict(
playwright=True,
playwright_page_coroutines={
"click": PageCoroutine("click", selector="a"),
"pdf": PageCoroutine("pdf", path="/tmp/file.pdf"),
},
),
)
def parse(self, response):
pdf_bytes = response.meta["playwright_page_coroutines"]["pdf"].result
with open("iana.pdf", "wb") as fp:
fp.write(pdf_bytes)
yield {"url": response.url} # response.url is "https://www.iana.org/domains/reserved"
Scroll down on an infinite scroll page, take a screenshot of the full page
class ScrollSpider(scrapy.Spider):
name = "scroll"
def start_requests(self):
yield scrapy.Request(
url="http://quotes.toscrape.com/scroll",
meta=dict(
playwright=True,
playwright_include_page=True,
playwright_page_coroutines=[
PageCoroutine("wait_for_selector", "div.quote"),
PageCoroutine("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
PageCoroutine("wait_for_selector", "div.quote:nth-child(11)"), # 10 per page
],
),
)
async def parse(self, response):
page = response.meta["playwright_page"]
await page.screenshot(path="quotes.png", fullPage=True)
await page.close()
return {"quote_count": len(response.css("div.quote"))} # quotes from several pages
For more examples, please see the scripts in the examples directory.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file scrapy-playwright-0.0.5.tar.gz
.
File metadata
- Download URL: scrapy-playwright-0.0.5.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e34857d6692ad7e21d0939bf8c2de7cb1107e7b01300f253c0587aeb35ea905a |
|
MD5 | bedffe7e367ace1791095274a48f16fc |
|
BLAKE2b-256 | 47111f14e1abd69d67e80f9f406de5be28e9898bbad5628ecfa0f513b178bb34 |
File details
Details for the file scrapy_playwright-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: scrapy_playwright-0.0.5-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.9.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cef816dae80f3864b34a58fecf9001d6351a675e2c8ca5503f865ca82113ba67 |
|
MD5 | cce93ef3970d786700b6579d9c9b2886 |
|
BLAKE2b-256 | 62e3de40ec6d483f77a3dc43b86f18502d4b86f270913a4782856f5c0f208fed |