A Scrapy middleware to use with autologin
Project description
This is a a Scrapy middleware that uses autologin http-api to maintain a logged-in state for a scrapy spider.
Install with pip:
pip install autologin-middleware
Include the autologin middleware into the project settings and specify autologin url:
AUTOLOGIN_URL = 'http://127.0.0.1:8089' AUTOLOGIN_ENABLED = True DOWNLOADER_MIDDLEWARES['autologin_middleware.AutologinMiddleware'] = 605
Cookie support is also required. There are currently several options:
scrapy cookie middleware (COOKIES_ENABLED = True), but autologin middleware requires access to cookies, so you need to enable a custom cookie middleware:
DOWNLOADER_MIDDLEWARES = { 'autologin_middleware.AutologinMiddleware': 605, 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None, 'autologin_middleware.ExposeCookiesMiddleware': 700, }
scrapy-splash cookie middleware (scrapy_splash.SplashCookiesMiddleware)
any other middleware that gets cookies from request.cookies and sets response.cookiejar like scrapy-splash middleware, or exposes them in response.flags like ExposeCookiesMiddleware.
Autologin middleware uses autologin to make all requests while being logged in. It uses autologin to get cookies, detects logouts and tries to avoid them in the future. A single authorization domain for the spider is assumed. Autologin middleware also puts autologin_active into request.meta, which is True only if we are logged in (and to False if domain is skipped or login failed). If requests are made via splash (and SPLASH_URL is set), autologin middleware passes it to autologin, and this splash instance is also used to obtain login cookies.
There are some optional settings:
AUTOLOGIN_COOKIES: pass auth cookies after manual login (format is name=value; name2=value2).
AUTOLOGIN_LOGOUT_URL: pass url substring to avoid.
AUTOLOGIN_CHECK_LOGOUT: set to False in order to disable automatic logout detection: it remembers cookies obtained during login and checks them on each response to see if any disappeared. This can be problematic for sites that set a lot of cookies on login, so this is an option to disable it. If you disable it, you must rely on avoiding logout links with link_looks_like_logout (see below), or setting a custom AUTOLOGIN_LOGOUT_URL.
AUTOLOGIN_USERNAME, AUTOLOGIN_PASSWORD, AUTOLOGIN_LOGIN_URL, AUTOLOGIN_EXTRA_JS are passed to autologin and override values from stored credentials. AUTOLOGIN_LOGIN_URL is a relative url, and can be omitted if it is the same as the start url. AUTOLOGIN_EXTRA_JS is required only if you want to use the extra_js feature of the autologin.
Autologin middleware passes the following settings to the autologin: SPLASH_URL, USER_AGENT, HTTP_PROXY, HTTPS_PROXY, so they are used for autologin requests.
There is also an utility autologin_middleware.link_looks_like_logout for checking if a links looks like a logout link: you can use it in the spider to avoid logout links. Logouts are handled by the autologin middleware anyway, but avoiding logout links can be beneficial for two reasons:
no time is waster retrying requests that were logged out
in some cases, logout urls can be unique, and the spider will be logging out continuously (for example, /logout?sid=UNIQUE_ID).
Check tests.utils.TestSpider for an example of a minimal spider that uses link_looks_like_logout, and an example of project settings.
License is MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for autologin-middleware-0.1.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58b935c0a7368a576ab5ee4f9c78520b142220db944ef3e93e37fe202bdcf951 |
|
MD5 | 652840f8ca280b81f4640f95855f1122 |
|
BLAKE2b-256 | 043a81bdef8327cb286c1240113b0af06b56c39782960af2eb295a69e08d57a0 |
Hashes for autologin_middleware-0.1.2-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ddcfba906cbef4f97c78a1fabba81277a0b8dc4e85656c7e2cd2123086e58967 |
|
MD5 | 6f18309efa2a30dd9080e53792846bfe |
|
BLAKE2b-256 | bd612eb2df13239a61fa7189c65f4e4b6dc711b66a064fedfbd38634ce3144f9 |