Skip to main content

Programmatic web browsing module with AJAX support for Python

Project description

Intro

https://secure.travis-ci.org/makinacorpus/spynner.png

Spynner is a stateful programmatic web browser module for Python. It is based upon PyQT and WebKit. It supports Javascript, AJAX, and every other technology that !WebKit is able to handle (Flash, SVG, …). Spynner takes advantage of JQuery. a powerful Javascript library that makes the interaction with pages and event simulation really easy.

Using Spynner you would able to simulate a web browser with no GUI (though a browsing window can be opened for debugging purposes), so it may be used to implement crawlers or acceptance testing tools.

See usage on: https://github.com/kiorky/spynner/tree/master/src/spynner/tests/spynner.rst Or below if the section is preset

Credits

Companies

makinacom

Authors

Contributors

Dependencies

  • Python >=26

  • PyQt > 443

  • Libxml2 / Libxslt libraries and includes files for lxml

  • autopy which in turns need xtst lib & headers on linux (aka Xtest)

Feedback

Open an Issue to report a bug or request a new feature. Other comments and suggestions can be directly emailed to the authors.

Install

  • Throught regular easy_install / buildout:

    easy_install spynner

    (In Windows, you may have to install autopy through its installer at https://pypi-hypernode.com/pypi/autopy/)

  • The bleeding edge version is hosted on github:

    git clone https://github.com/kiorky/spynner.git
    cd spynner
    python setup.py install

Running Spynner without X11

  • Spynner needs a X11 server to run. If you are running it in a server without X11. You must install the virtual Xvfb server. Debian users can use the small wrapper (xvfb-run). If you are not using Debian, you can download it here: http://www.mail-archive.com/debian-x@lists.debian.org/msg69632/x-run

    xvfb-run python myscript_using_spynner.py
  • You can also use tightvnc, which is the solution of the actual maintainer [kiorky].

Initializing the browser

The main concept to have a browser out there:

>>> import spynner, os, sys
>>> def print_contents(browser, dest='~/.browser.html'):
...     """Print the browser contents somewhere for you to see its context
...     in doctest pdb, type print_contents(browser) and that's it, open firefox
...     with file://~/browser.html."""
...     import os
...     open(os.path.expanduser(dest), 'w').write(browser.contents)
>>> import time
>>> from StringIO import StringIO
>>> debug_stream = StringIO()
>>> bp = os.path.dirname(spynner.tests.__file__)

The browser:

>>> browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream)

When all is done:

>>> browser.close()
>>> def run_debug(callback, *args, **kwargs): # ** *
...     pos = debug_stream.pos
...     ret = callback(*args, **kwargs)
...     show_debug(pos)
...     return ret


>>> def show_debug(pos=None):
...     if not pos: print debug_stream.getvalue()
...     else:
...         pnow = debug_stream.pos
...         debug_stream.seek(pos)
...         print debug_stream.read()
...         debug_stream.seek(pnow)

Debugging

Spynner uses webkit which is somewhat low level, never hesitate to activate verbose logs Sometimes you’ll want to see what is going on:

>>> browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream)

Or after initialization:

>>> browser.debug_level = spynner.DEBUG

See more examples in the repository: https://github.com/kiorky/spynner/tree/master/examples

Showing spynner window

Maybe, you also want to have an output of what the browser is doing, just use that:

>>> browser.show()

You can hide the webview with:

thebrowser.hide()

Running Javascript

Simply use:

>>> ret = browser.runjs('console.log("foobar")')

Browsing with spynner

A basic but complicated example Word reference has resources loading which can fails, for thus we wait on the content to be there.

If the website was good, we could simple use

>>> ret = debug_stream.read()
>>> browser.load(bp+"/html_controls.html")
True

This method throws an exception on timeout, and can customize the default 30 seconds timeout.

But there, our target can randomly fails. Instead, we will load and wait for something in the DOM to be there to continue. We wait to have ‘aaa’ in the html, thus with unlimited tries at 1 seconds intervals each

>>> def wait_load(br):
...     return  'aaa' in browser.html

Hit the wrong url, Eck, you are on an unlimited loop !:

>>> browser.load(bp+"html_controls.html", 1, wait_callback=wait_load)

content loaded, waiting for content to mach the callback
content loaded, waiting for content to mach the callback
content loaded, waiting for content to mach the callback
content loaded, waiting for content to mach the callback
content loaded, waiting for content to mach the callback
content loaded, waiting for content to mach the callback
content loaded, waiting for content to mach the callback
content loaded, waiting for content to mach the callback
content loaded, waiting for content to mach the callback
content loaded, waiting for content to mach the callback
content loaded, waiting for content to mach the callback
<Control-C>

Hit the wrong url, Eck, you are on an unlimited loop unless you wear condoms and set the tries! It will throw an exception, but stop:

>>> ret = debug_stream.read()

Traceback (most recent call last):
  ...
SpynnerTimeout: SPYNNER waitload: Timeout reached: 2 retries for 1s delay.

Finnish to play, go to the real target:

>>> ret = browser.load(bp+"/html_controls.html", 1, wait_callback=wait_load)
>>> [a for a in debug_stream.getvalue().splitlines() if 'SPYNNER waitload' in a][-1]
'SPYNNER waitload: The callback found what it was waiting for in its contents!'

Interact with the controls

  • See the implementation docstrings or examples !

  • You have three levels of control:

    • webkit methods which are recommended to us (wk_fill_*, wk_click_*) which are jquery based. The fill_* and click_*

    • The classical methods (fill, click_*) are now wrappers to the wk_* methods.

    • low level using QT raw events which are not that well working ATM. At least, you can move the mouse and sendKeys but it’s a case per case coding.

Setup:

>>> browser.close()
>>> del browser

Using radio inputs

>>> browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream)
>>> ret = browser.load(bp+'/html_controls.html', 1, wait_callback=wait_load)

Using jquery

>>> browser.load_jquery(True)

>>> browser.radio('#radiomea')

 >>> ret = run_debug(browser.runjs, '$("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));});')
 Run Javascript code: $("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));});
 Javascript console (:1): radiomea a true
 Javascript console (:1): radiomeb b false
 Javascript console (:1): radiomec c false
 <BLANKLINE>
 >>> browser.radio('#radiomeb')
 >>> ret = run_debug(browser.runjs, '$("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));});')
 Run Javascript code: $("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));});
 Javascript console (:1): radiomea a false
 Javascript console (:1): radiomeb b true
 Javascript console (:1): radiomec c false
 <BLANKLINE>

Using webkit native methods

Under the hood, we use this.evaluateJavaScript(‘this.value = xxx’)

>>> browser.wk_radio('#radiomea')
>>> browser.load_jquery(True)
>>> ret = run_debug(browser.runjs, '$("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));});')
Run Javascript code: $("input[name=radiome]").each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.val()+" "+je.attr("checked"));});
Javascript console (:1): radiomea a true
Javascript console (:1): radiomeb b false
Javascript console (:1): radiomec c false
<BLANKLINE>

Using check inputs

Using webkit native methods

>>> browser.close()
>>> browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream)
>>> ret = browser.load(bp+'/html_controls.html', 1, wait_callback=wait_load)
>>> ret = browser.load_jquery(True)

Under the hood, we use this.evaluateJavaScript(‘this.value = xxx’)

>>> browser.wk_check('#checkmea')
>>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});')
Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});
Javascript console (:1): checkmea true
Javascript console (:1): checkmeb false
Javascript console (:1): checkmec false
<BLANKLINE>
>>> browser.wk_check(['#checkmeb', '#checkmec'])
>>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});')
Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});
Javascript console (:1): checkmea true
Javascript console (:1): checkmeb true
Javascript console (:1): checkmec true
<BLANKLINE>
>>> browser.wk_uncheck(['#checkmeb', '#checkmec'])
>>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});')
Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});
Javascript console (:1): checkmea true
Javascript console (:1): checkmeb false
Javascript console (:1): checkmec false
<BLANKLINE>
>>> browser.wk_uncheck(['#checkmea'])
>>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});')
Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});
Javascript console (:1): checkmea false
Javascript console (:1): checkmeb false
Javascript console (:1): checkmec false
<BLANKLINE>

Using jquery

>>> browser.load(bp+'/html_controls.html', 1, wait_callback=wait_load)
>>> browser.load_jquery(True)

Under the hood, we use $(sel).attr(‘checked’, ‘checked’):

>>> browser.check('#checkmea')
>>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});')
Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});
Javascript console (:1): checkmea true
Javascript console (:1): checkmeb false
Javascript console (:1): checkmec false
<BLANKLINE>
>>> browser.check(['#checkmeb', '#checkmec'])
>>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});')
Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});
Javascript console (:1): checkmea true
Javascript console (:1): checkmeb true
Javascript console (:1): checkmec true
<BLANKLINE>
>>> browser.uncheck(['#checkmeb', '#checkmec'])
>>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});')
Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});
Javascript console (:1): checkmea true
Javascript console (:1): checkmeb false
Javascript console (:1): checkmec false
<BLANKLINE>
>>> browser.uncheck(['#checkmea'])
>>> ret = run_debug(browser.runjs, '$($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});')
Run Javascript code: $($("input[name=checkme]")).each(function(i, e){je=$(e);console.log(je.attr("id")+" "+je.attr("checked"));});
Javascript console (:1): checkmea false
Javascript console (:1): checkmeb false
Javascript console (:1): checkmec false
<BLANKLINE>

Using select inputs

Using webkit native methods

>>> ret = browser.load(bp+'/html_controls.html', 1, wait_callback=wait_load)
>>> ret = browser.load_jquery(True)

Under the hood, we use this.evaluateJavaScript(‘this.value = xxx’)

>>> browser.wk_select('#sel', 'aa')
>>> browser.runjs('$("#sel").val();').toString()
PyQt4.QtCore.QString(u'aa')
>>> browser.wk_select('#sel', 'bb')
>>> browser.runjs('$("#sel").val();').toString()
PyQt4.QtCore.QString(u'bb')
>>> browser.wk_select('#sel', 'dd')
>>> browser.runjs('$("#sel").val();').toString()
PyQt4.QtCore.QString(u'dd')

If it is not a multiple it takes the last:

>>> browser.wk_select('#sel', ['aa', 'bb', 'dd'])
>>> browser.runjs('$("#sel").val();').toString()
PyQt4.QtCore.QString(u'dd')

If it is a multiple it takes all:

>>> browser.wk_select('#msel', ['maa', 'mbb', 'mdd'])
>>> ret = run_debug(browser.runjs, '$($("#msel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});')
Run Javascript code: $($("#msel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});
Javascript console (:1): maaa true
Javascript console (:1): mbbb true
Javascript console (:1): mccc false
Javascript console (:1): mddd true
<BLANKLINE>

Using jquery

>>> browser.load(bp+'/html_controls.html', 1, wait_callback=wait_load)
>>> browser.load_jquery(True)

Under the hood, we use $(sel).attr(“selected”, “selected”):

>>> browser.select('#sel option[name="bbb"]')
>>> pos = debug_stream.pos
>>> ret = run_debug(browser.runjs, '$($("#sel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});')
Run Javascript code: $($("#sel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});
Javascript console (:1): aaa false
Javascript console (:1): bbb true
Javascript console (:1): ccc false
Javascript console (:1): ddd false
<BLANKLINE>

With a select with multiple args, it can also not deselect already selected values (remove as default):

>>> browser.select('#asel option[name="bbb"]', remove=False)
>>> ret = run_debug(browser.runjs, '$($("#asel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});')
Run Javascript code: $($("#asel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});
Javascript console (:1): aaa false
Javascript console (:1): bbb true
Javascript console (:1): ccc true
Javascript console (:1): ddd false
<BLANKLINE>
>>> browser.select('#asel option[name="bbb"]', remove=True)
>>> ret = run_debug(browser.runjs, '$($("#asel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});')
Run Javascript code: $($("#asel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});
Javascript console (:1): aaa false
Javascript console (:1): bbb true
Javascript console (:1): ccc false
Javascript console (:1): ddd false
<BLANKLINE>

If it is a multiple it takes all:

>>> browser.select(['#msel option[name="mbbb"]', '#msel option[name="mddd"]'])
>>> ret = run_debug(browser.runjs, '$($("#msel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});')
Run Javascript code: $($("#msel option")).each(function(i, e){je=$(e);console.log(je.attr("name")+" "+je.attr("selected"));});
Javascript console (:1): maaa false
Javascript console (:1): mbbb true
Javascript console (:1): mccc false
Javascript console (:1): mddd true
<BLANKLINE>

Using text inputs

Using webkit native methods

Under the hood, we use this.evaluateJavaScript(‘this.value = xxx’):

>>> browser.wk_fill('input[name=w]', 'bar')

Using jquery

Under the hood, we use jQuery(selector).val(xxx):

>>> browser.fill('input[name="w"]', 'foo')
>>> ret = run_debug(browser.fill, 'input[name="w"]', 'foo')
Run Javascript code: $('input[name="w"]').val('foo')
<BLANKLINE>

Jquery Notes

Spynner uses jQuery to make Javascript interface easier. By default, two modules are injected to every loaded page:

  • JQuery core Amongst other things, it adds the powerful JQuery selectors, which are used internally by some Spynner methods. Of course you can also use jQuery when you inject your own code into a page.

  • [OBSOLETE, USE AT YOU OWN RISK, NO MAINTAINED, NO BUGFIX DONE] Simulate jQuery plugin: Makes it possible to simulate mouse and keyboard events (for now spynner uses it only in the _click_ action). Look up the library code to see which kind of events you can fire.

AS nowodays jquery is already included on major websites, so we must not inject if the javascript is already loaded by the targeted website.

Loading manually jquery

>>> time.sleep(3)
>>> browser.close()
>>> browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream)
>>> browser.show()
>>> ret = run_debug(browser.runjs,"console.log(typeof(jQuery));")
Run Javascript code: console.log(typeof(jQuery));
Javascript console (:1): undefined
<BLANKLINE>
Eck, we didnt included jQuery !

loading it:

>>> ret = browser.load_jquery(force=True)
>>> ret = run_debug(browser.runjs, "console.log(typeof(jQuery));")
Run Javascript code: console.log(typeof(jQuery));
Javascript console (:1): function
<BLANKLINE>

Cook your soup: parsing the HTML

You can parse the HTML of a webpage with your favorite parsing library eg: BeautifulSoup, lxml , or lxml, or … Since we are already using Jquery for Javascript. It feels just natural to work with pyquery, its Python counterpart:

>>> import pyquery
>>> ret = browser.load(bp+'/html_controls.html')
>>> d = pyquery.PyQuery(browser.html)
>>> aaa = d.make_links_absolute("http://foo")[0]
>>> [dict(a.items())['href'] for a in  d.root.xpath('//a')]
['http://foo/foo', 'http://foo/a/foo', 'http://foo/../b/foo', 'http://foo/c/foo', 'http://foo/d/foo']

HTTP Headers

You can give a list of http headers to send either which each request at construct time or via the load methods

Headers are in the form:

  • ([‘User-Agent’, ‘foobar’)]

SSL support

you have two keywords argument to specify:

  • a list (see QtSsl) of supported ciphers to use

  • the protocol to use (sslv2, tlsv1, sslv)3)

Mouse

you can move the move on a css selector

br.move_mouse('.myclass', [offsetx=0, offsety=0])

Proxy support

Spynner support all proxiess supported by qt (http(s), socks5 & ftp)

See examples/proxy.py in the examples directory

basically use:

br.set_proxy('foo:3128')
br.set_proxy('http://foo:3128')
br.set_proxy('http://user:suserpassword@foo:3128')
br.set_proxy('https://user:suserpassword@foo:3128')
br.set_proxy('socks5://user:suserpassword@foo:3128')
br.set_proxy('httpcaching://user:suserpassword@foo:3128')
br.set_proxy('ftpcaching://user:suserpassword@foo:3128')

You can also use proxy in the download method. Note that it will use by default the proxy setted via a previous br.set_proxy call:

br.download('http://superfile', proxy_url='foo:3128')

CHANGELOG

2.24 (2019-04-20)

  • support

2.23 (2019-03-26)

  • support

2.18 (2014-07-19)

  • changelog fix

2.17 (2014-07-19)

  • py3 support

2.16 (2014-04-25)

  • fix a bug in download, reply can be not finished to read on exit

2.15 (2013-07-16)

  • fix #46

2.14 (2013-06-05)

  • cookie jar fix (#41)

2.13 (2013-05-17)

  • Better proxy support

  • Travis setup

2.12 (2013-05-03)

  • Cookie jar fix

2.11 (2013-04-23)

  • fix release again

2.10 (2013-04-22)

  • fix release

2.9 (2013-04-22)

  • run natives clicks with autopy

2.8 (2013-04-19)

  • add a helper to move the mouse more easily

2.7 (2013-04-17)

  • Better ssl support

  • better http headers support

  • pyside support

  • better cookie support

2.6 (2013-03-07)

  • fix #17: download timeout

2.5 (2013-03-06)

  • fix #25: new signal api for sslErrors

2.4 (2012-09-28)

  • Example google fixed

2.3 (2012-09-28)

  • documentation

2.2 (2012-09-20)

  • Fix bug where jquery compatiblity mode can be not activated thx to yusumishi (yusumishi@gmail.com) for report.

2.1 (2012-08-30)

  • proper release

2.0 (2012-08-05)

  • Make new defaults for sane initialization & api cleanup, now:

    • We remapped simulations’s functions to wk_* ones

    • we added extensive documentation in src/spynner/tests/spynner.rst

    • we do not embed jquery as default

    • we do not embed jquery’s simulate plugins automaticly which is totally deprecated

1.11 (2012-08-04)

  • proper release

1.10 (2011-06-07)

  • add wk_check/_unckeck methods

1.9 (2011-05-29)

  • Rework javascript load [kiorky]

  • Some try in native events [kiorky]

  • Fix directory issue [kiorky]

  • add Samples [kiorky]

  • Fix download cookiesjar free problem [kiorky <kiorky@cryptelium.net>]

  • Allow download to be tracked for further reuse [kiorky <kiorky@cryptelium.net>]

  • Generate filenames by looking for their filename in response objects. [kiorky <kiorky@cryptelium.net>]

  • Add api methods to:

    • send raw keyboard keys

    • send qt raw mouse clicks

    • use qtwebkit native JS click element & fill values

    • some helpers to wait for content

    [kiorky]

  • Add download files tracker [kiorky]

0.0.3 (2009-08-01)

  • Click does not wait for page load

  • Use QtNetwork infrastructure to download files

  • Expose webkit objects in Browser class

  • Change jQuery to _jQuery

  • HTTP authentication

  • Callbacks for Javascript confirm and prompts

  • Properties: url, html, soup

  • Better docstrings (using epydoc)

  • Implement image snapshots

  • Implement URL filters

  • Implement cookies setting [tokland <pyarnau@gmail.com>]

0.0.2 (2009-07-27)

  • Use browser.html instead of browser.get_html

  • Fix setup.py to make it compatible with Win32

  • Add a URL filter mechanism (with a callback)

  • Use class-methods instead of burdening Browser.__init__

  • Instance variable to ignore SSL certificate errors

  • Start using epydoc format for API documentation

  • Add create_webview/destroy_webview for GUI debugging [tokland <pyarnau@gmail.com>]

0.0.1 (2009-07-25)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spynner-2.24.tar.gz (136.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page