Programmatic web browsing module with AJAX support for Python
Project description
Intro
Spynner is a stateful programmatic web browser module for Python. It is based upon PyQT and WebKit. It supports Javascript, AJAX, and every other technology that !WebKit is able to handle (Flash, SVG, …). Spynner takes advantage of JQuery. a powerful Javascript library that makes the interaction with pages and event simulation really easy.
Using Spynner you would able to simulate a web browser with no GUI (though a browsing window can be opened for debugging purposes), so it may be used to implement crawlers or acceptance testing tools.
See usage on: https://github.com/makinacorpus/spynner/tree/master/src/spynner/tests/spynner.rst Or below if the section is preset
Credits
Companies
Contributors
Leo Lou <https://github.com/l4u>
Dependencies
Libxml2 / Libxslt libraries and includes files for lxml
autopy which in turns need xtst lib & headers on linux (aka Xtest)
Feedback
Open an Issue to report a bug or request a new feature. Other comments and suggestions can be directly emailed to the authors.
Install
Throught regular easy_install / buildout:
easy_install spynner
The bleeding edge version is hosted on github:
git clone https://github.com/makinacorpus/spynner.git cd spynner python setup.py install
Running Spynner without X11
Spynner needs a X11 server to run. If you are running it in a server without X11. You must install the virtual Xvfb server. Debian users can use the small wrapper (xvfb-run). If you are not using Debian, you can download it here: http://www.mail-archive.com/debian-x@lists.debian.org/msg69632/x-run
xvfb-run python myscript_using_spynner.py
You can also use tightvnc, which is the solution of the actual maintainer [kiorky].
Initialazing the browser
The main concept to have a browser out there:
>>> import spynner >>> import time >>> from StringIO import StringIO >>> debug_stream = StringIO() >>> bp = os.path.dirname(spynner.tests.__file__)
The browser:
>>> browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream)
When all is done:
>>> browser.close() >>> def run_debug(callback, *args, **kwargs): # ** * ... pos = debug_stream.pos ... ret = callback(*args, **kwargs) ... show_debug(pos) ... return ret >>> def show_debug(pos=None): ... if not pos: print debug_stream.getvalue() ... else: ... pnow = debug_stream.pos ... debug_stream.seek(pos) ... print debug_stream.read() ... debug_stream.seek(pnow)
Debugging
Spynner uses webkit which is somewhat low level, never hesitate to activate verbose logs Sometimes you’ll want to see what is going on:
>>> browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream)
Or after initialization:
>>> browser.debug_level = spynner.DEBUG
See more examples in the repository: https://github.com/kiorky/spynner/tree/master/examples
Showing spynner window
Maybe, you also want to have an output of what the browser is doing, just use that:
>>> browser.show()
You can hide the webview with:
thebrowser.hide()
Running Javascript
Simply use:
>>> ret = browser.runjs('console.log("foobar")')
Browsing with spynner
A basic but complicated example Word reference has resources loading which can fails, for thus we wait on the content to be there.
If the website was good, we could simple use
>>> ret = debug_stream.read() >>> browser.load(bp+"/html_controls.html") True
This method throws an exception on timeout, and can customize the default 30 seconds timeout.
But there, our target can randomly fails. Instead, we will load and wait for something in the DOM to be there to continue. We wait to have ‘aaa’ in the html, thus with unlimited tries at 1 seconds intervals each
>>> def wait_load(br): ... return 'aaa' in browser.html
Hit the wrong url, Eck, you are on an unlimited loop !:
>>> browser.load(bp+"html_controls.html", 1, wait_callback=wait_load) content loaded, waiting for content to mach the callback content loaded, waiting for content to mach the callback content loaded, waiting for content to mach the callback content loaded, waiting for content to mach the callback content loaded, waiting for content to mach the callback content loaded, waiting for content to mach the callback content loaded, waiting for content to mach the callback content loaded, waiting for content to mach the callback content loaded, waiting for content to mach the callback content loaded, waiting for content to mach the callback content loaded, waiting for content to mach the callback <Control-C>
Hit the wrong url, Eck, you are on an unlimited loop unless you wear condoms and set the tries! It will throw an exception, but stop:
>>> ret = debug_stream.read() Traceback (most recent call last): ... SpynnerTimeout: SPYNNER waitload: Timeout reached: 2 retries for 1s delay.
Finnish to play, go to the real target:
>>> ret = browser.load(bp+"/html_controls.html", 1, wait_callback=wait_load) >>> [a for a in debug_stream.getvalue().splitlines() if 'SPYNNER waitload' in a][-1] 'SPYNNER waitload: The callback found what it was waiting for in its contents!'
Interact with the controls
See the implementation docstrings or examples !
You have three levels of control:
webkit methods which are recommended to us (wk_fill_*, wk_click_*) which are jquery based. The fill_* and click_*
The classical methods (fill, click_*) are now wrappers to the wk_* methods.
low level using QT raw events which are not that well working ATM. At least, you can move the mouse and sendKeys but it’s a case per case coding.
Setup:
>>> browser.close() >>> del browser
Using radio inputs
>>> browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream) >>> ret = browser.load(bp+'/html_controls.html', 1, wait_callback=wait_load)
Using jquery
>>> browser.load_jquery(True) >>> browser.radio('#radiomea') >>> ret = run_debug(browser.runjs, '$("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));});') Run Javascript code: $("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));}); Javascript console (:1): radiomea a true Javascript console (:1): radiomeb b false Javascript console (:1): radiomec c false <BLANKLINE> >>> browser.radio('#radiomeb') >>> ret = run_debug(browser.runjs, '$("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));});') Run Javascript code: $("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));}); Javascript console (:1): radiomea a false Javascript console (:1): radiomeb b true Javascript console (:1): radiomec c false <BLANKLINE>
Using webkit native methods
Under the hood, we use this.evaluateJavaScript(‘this.value = xxx’)
>>> browser.wk_radio('#radiomea') >>> browser.load_jquery(True) >>> ret = run_debug(browser.runjs, '$("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));});') Run Javascript code: $("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));}); Javascript console (:1): radiomea a true Javascript console (:1): radiomeb b false Javascript console (:1): radiomec c false <BLANKLINE>
Using check inputs
Using webkit native methods
>>> browser.close() >>> browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream) >>> ret = browser.load(bp+'/html_controls.html', 1, wait_callback=wait_load) >>> ret = browser.load_jquery(True)
Under the hood, we use this.evaluateJavaScript(‘this.value = xxx’)
>>> browser.wk_check('#checkmea') >>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});') Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));}); Javascript console (:1): checkmea true Javascript console (:1): checkmeb false Javascript console (:1): checkmec false >>> browser.wk_check(['#checkmeb', '#checkmec']) >>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});') Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));}); Javascript console (:1): checkmea true Javascript console (:1): checkmeb true Javascript console (:1): checkmec true >>> browser.wk_uncheck(['#checkmeb', '#checkmec']) >>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});') Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));}); Javascript console (:1): checkmea true Javascript console (:1): checkmeb false Javascript console (:1): checkmec false >>> browser.wk_uncheck(['#checkmea']) >>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});') Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));}); Javascript console (:1): checkmea false Javascript console (:1): checkmeb false Javascript console (:1): checkmec false
Using jquery
>>> browser.load(bp+'/html_controls.html', 1, wait_callback=wait_load) >>> browser.load_jquery(True)
Under the hood, we use $(sel).attr(‘checked’, ‘checked’):
>>> browser.check('#checkmea') >>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});') Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));}); Javascript console (:1): checkmea true Javascript console (:1): checkmeb false Javascript console (:1): checkmec false >>> browser.check(['#checkmeb', '#checkmec']) >>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});') Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));}); Javascript console (:1): checkmea true Javascript console (:1): checkmeb true Javascript console (:1): checkmec true >>> browser.uncheck(['#checkmeb', '#checkmec']) >>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});') Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));}); Javascript console (:1): checkmea true Javascript console (:1): checkmeb false Javascript console (:1): checkmec false >>> browser.uncheck(['#checkmea']) >>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});') Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));}); Javascript console (:1): checkmea false Javascript console (:1): checkmeb false Javascript console (:1): checkmec false
Using select inputs
Using webkit native methods
>>> ret = browser.load(bp+'/html_controls.html', 1, wait_callback=wait_load) >>> ret = browser.load_jquery(True)
Under the hood, we use this.evaluateJavaScript(‘this.value = xxx’)
>>> browser.wk_select('#sel', 'aa') >>> browser.runjs('$("#sel").val();').toString() PyQt4.QtCore.QString(u'aa') >>> browser.wk_select('#sel', 'bb') >>> browser.runjs('$("#sel").val();').toString() PyQt4.QtCore.QString(u'bb') >>> browser.wk_select('#sel', 'dd') >>> browser.runjs('$("#sel").val();').toString() PyQt4.QtCore.QString(u'dd')
If it is not a multiple it takes the last:
>>> browser.wk_select('#sel', ['aa', 'bb', 'dd']) >>> browser.runjs('$("#sel").val();').toString() PyQt4.QtCore.QString(u'dd')
If it is a multiple it takes all:
>>> browser.wk_select('#msel', ['maa', 'mbb', 'mdd']) >>> ret = run_debug(browser.runjs, '$($("#msel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});') Run Javascript code: $($("#msel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));}); Javascript console (:1): maaa true Javascript console (:1): mbbb true Javascript console (:1): mccc false Javascript console (:1): mddd true
Using jquery
>>> browser.load(bp+'/html_controls.html', 1, wait_callback=wait_load) >>> browser.load_jquery(True)
Under the hood, we use $(sel).attr(“selected”, “selected”):
>>> browser.select('#sel option[name="bbb"]') >>> pos = debug_stream.pos >>> ret = run_debug(browser.runjs, '$($("#sel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});') Run Javascript code: $($("#sel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));}); Javascript console (:1): aaa false Javascript console (:1): bbb true Javascript console (:1): ccc false Javascript console (:1): ddd false
With a select with multiple args, it can also not deselect already selected values (remove as default):
>>> browser.select('#asel option[name="bbb"]', remove=False) >>> ret = run_debug(browser.runjs, '$($("#asel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});') Run Javascript code: $($("#asel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));}); Javascript console (:1): aaa false Javascript console (:1): bbb true Javascript console (:1): ccc true Javascript console (:1): ddd false >>> browser.select('#asel option[name="bbb"]', remove=True) >>> ret = run_debug(browser.runjs, '$($("#asel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});') Run Javascript code: $($("#asel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));}); Javascript console (:1): aaa false Javascript console (:1): bbb true Javascript console (:1): ccc false Javascript console (:1): ddd false
If it is a multiple it takes all:
>>> browser.select(['#msel option[name="mbbb"]', '#msel option[name="mddd"]']) >>> ret = run_debug(browser.runjs, '$($("#msel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});') Run Javascript code: $($("#msel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));}); Javascript console (:1): maaa false Javascript console (:1): mbbb true Javascript console (:1): mccc false Javascript console (:1): mddd true
Using text inputs
Using webkit native methods
Under the hood, we use this.evaluateJavaScript(‘this.value = xxx’):
>>> browser.wk_fill('input[name=w]', 'bar')
Using jquery
Under the hood, we use jQuery(selector).val(xxx):
>>> browser.fill('input[name="w"]', 'foo') >>> ret = run_debug(browser.fill, 'input[name="w"]', 'foo') Run Javascript code: $('input[name="w"]').val('foo')
Jquery Notes
Spynner uses jQuery to make Javascript interface easier. By default, two modules are injected to every loaded page:
JQuery core Amongst other things, it adds the powerful JQuery selectors, which are used internally by some Spynner methods. Of course you can also use jQuery when you inject your own code into a page.
[OBSOLETE, USE AT YOU OWN RISK, NO MAINTAINED, NO BUGFIX DONE] Simulate jQuery plugin: Makes it possible to simulate mouse and keyboard events (for now spynner uses it only in the _click_ action). Look up the library code to see which kind of events you can fire.
AS nowodays jquery is already included on major websites, so we must not inject if the javascript is already loaded by the targeted website.
Loading manually jquery
>>> time.sleep(3) >>> browser.close() >>> browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream) >>> browser.show() >>> ret = run_debug(browser.runjs,"console.log(typeof(jQuery));") Run Javascript code: console.log(typeof(jQuery)); Javascript console (:1): undefined
- Eck, we didnt included jQuery !
loading it:
>>> ret = browser.load_jquery(force=True) >>> ret = run_debug(browser.runjs, "console.log(typeof(jQuery));") Run Javascript code: console.log(typeof(jQuery)); Javascript console (:1): function
Cook your soup: parsing the HTML
You can parse the HTML of a webpage with your favorite parsing library eg: BeautifulSoup, lxml , or lxml, or … Since we are already using Jquery for Javascript. It feels just natural to work with pyquery, its Python counterpart:
>>> import pyquery >>> ret = browser.load(bp+'/html_controls.html') >>> d = pyquery.PyQuery(browser.html) >>> aaa = d.make_links_absolute("http://foo")[0] >>> [dict(a.items())['href'] for a in d.root.xpath('//a')] ['http://foo/foo', 'http://foo/a/foo', 'http://foo/../b/foo', 'http://foo/c/foo', 'http://foo/d/foo']
HTTP Headers
You can give a list of http headers to send either which each request at construct time or via the load methods
Headers are in the form:
([‘User-Agent’, ‘foobar’)]
SSL support
you have two keywords argument to specify:
a list (see QtSsl) of supported ciphers to use
the protocol to use (sslv2, tlsv1, sslv)3)
Mouse
you can move the move on a css selector
br.move_mouse('.myclass', [offsetx=0, offsety=0])
Proxy support
Spynner support all proxiess supported by qt (http(s), socks5 & ftp)
See examples/proxy.py in the examples directory
basically use:
br.set_proxy('foo:3128') br.set_proxy('http://foo:3128') br.set_proxy('http://user:suserpassword@foo:3128') br.set_proxy('https://user:suserpassword@foo:3128') br.set_proxy('socks5://user:suserpassword@foo:3128') br.set_proxy('httpcaching://user:suserpassword@foo:3128') br.set_proxy('ftpcaching://user:suserpassword@foo:3128')
You can also use proxy in the download method. Note that it will use by default the proxy setted via a previous br.set_proxy call:
br.download('http://superfile', proxy_url='foo:3128')
CHANGELOG
2.16 (2014-04-25)
fix a bug in download, reply can be not finished to read on exit
2.15 (2013-07-16)
fix #46
2.14 (2013-06-05)
cookie jar fix (#41)
2.13 (2013-05-17)
Better proxy support
Travis setup
2.12 (2013-05-03)
Cookie jar fix
2.11 (2013-04-23)
fix release again
2.10 (2013-04-22)
fix release
2.9 (2013-04-22)
run natives clicks with autopy
2.8 (2013-04-19)
add a helper to move the mouse more easily
2.7 (2013-04-17)
Better ssl support
better http headers support
pyside support
better cookie support
2.6 (2013-03-07)
fix #17: download timeout
2.5 (2013-03-06)
fix #25: new signal api for sslErrors
2.4 (2012-09-28)
Example google fixed
2.3 (2012-09-28)
documentation
2.2 (2012-09-20)
Fix bug where jquery compatiblity mode can be not activated thx to yusumishi (yusumishi@gmail.com) for report.
2.1 (2012-08-30)
proper release
2.0 (2012-08-05)
Make new defaults for sane initialization & api cleanup, now:
We remapped simulations’s functions to wk_* ones
we added extensive documentation in src/spynner/tests/spynner.rst
we do not embed jquery as default
we do not embed jquery’s simulate plugins automaticly which is totally deprecated
1.11 (2012-08-04)
proper release
1.10 (2011-06-07)
add wk_check/_unckeck methods
1.9 (2011-05-29)
Rework javascript load [kiorky]
Some try in native events [kiorky]
Fix directory issue [kiorky]
add Samples [kiorky]
Fix download cookiesjar free problem [kiorky <kiorky@cryptelium.net>]
Allow download to be tracked for further reuse [kiorky <kiorky@cryptelium.net>]
Generate filenames by looking for their filename in response objects. [kiorky <kiorky@cryptelium.net>]
Add api methods to:
send raw keyboard keys
send qt raw mouse clicks
use qtwebkit native JS click element & fill values
some helpers to wait for content
[kiorky]
Add download files tracker [kiorky]
0.0.3 (2009-08-01)
Click does not wait for page load
Use QtNetwork infrastructure to download files
Expose webkit objects in Browser class
Change jQuery to _jQuery
HTTP authentication
Callbacks for Javascript confirm and prompts
Properties: url, html, soup
Better docstrings (using epydoc)
Implement image snapshots
Implement URL filters
Implement cookies setting [tokland <tokland@gmail.com>]
0.0.2 (2009-07-27)
Use browser.html instead of browser.get_html
Fix setup.py to make it compatible with Win32
Add a URL filter mechanism (with a callback)
Use class-methods instead of burdening Browser.__init__
Instance variable to ignore SSL certificate errors
Start using epydoc format for API documentation
Add create_webview/destroy_webview for GUI debugging [tokland <tokland@gmail.com>]
0.0.1 (2009-07-25)
Initial release. [tokland <tokland@gmail.com>]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.