Crawling and feeding html content into a transmogrifier pipeline
Project description
Introduction
- transmogrify.webcrawler
A source blueprint for crawling content from a site or local html files.
- transmogrify.webcrawler.typerecognitor
A blueprint for assinging content type based on the mime-type as given by the webcrawler
- transmogrify.webcrawler.cache
A blueprint that saves crawled content into a directory structure
transmogrify.webcrawler
A transmogrifier blueprint source which will crawl a url reading in all pages until all have been crawled.
Options
- site_url
URL to start crawling. The URL will be treated as the base and any links outside this base will be ignored
- ignore
Regular expressions for urls not to follow
- alias_bases
Substitutions for url bases. This is useful where url to access is not the same as absolute urls of links in the pages
- patterns
Regular expressions to substitute before html is parsed. New line seperated
- subs
Text to replace
- checkext
checkext
- verbose
verbose
- maxpage
maxpage
- nonames
nonames
- cache
cache
Keys inserted
The following set the keys items added to the pipeline
- pathkey
default: _path. The path of the url not including the base
- siteurlkey
default: _site_url. The base of the url
- originkey
default: _origin. The original path in case retriving the url caused a redirection
- contentkey
default: _content. The main content of the url
- contentinfokey
default: _content_info. Headers returned by urlopen
- sortorderkey
default: _sortoder. A count on when a link to this item was first encounted while crawling
- backlinkskey
default: _backlinks. A list of tuples of which pages linked to this item. (url, path)
Tests
>>> testtransmogrifier(dontprint=['_content'], source=""" ... [webcrawler] ... blueprint = transmogrify.webcrawler ... site_url = file://%s/test_staticsite ... alias_bases = http://somerandomsite file:/// ... """) {'_backlinks': [], '_content_info': {'content-type': 'text/html'}, '_origin': 'file://.../test_staticsite', '_path': '', '_site_url': 'file://.../test_staticsite/', '_sortorder': 0} ...
>>> testtransmogrifier(source=webcrawler, strip=['_content']) {... '_path': '', ...} {... '_path': 'file2.htm', ...} {... '_path': 'subfolder', ...} {... '_path': 'egenius-plone.gif', ...} {... '_path': 'plone_schema.png', ...} ...
>>> source = """ ... [webcrawler] ... blueprint = transmogrify.webcrawler ... site_url = file://%s/test_staticsite ... alias_bases = http://somerandomsite file:/// ... patterns = ... (?s)<SCRIPT.*Abbreviation"\) ... (?s)MakeLink\('(?P<u>[^']*)','(?P<a>[^']*)'\) ... (?s)State=.*<body[^>]*> ... subs = ... </head><body> ... <a href="\g<u>">\g<a></a> ... <br> ... """
External scripts used
http://svn.python.org/projects/python/trunk/Tools/webchecker/webchecker.py http://svn.python.org/projects/python/trunk/Tools/webchecker/websucker.py
TypeRecognitor
TypeRecognitor is a transmogrifier blue print which determines the plone type of the item from the mime_type in the header. It reads the mimetype from the headers in _content_info set by transmogrify.webrawler
>>> from os.path import dirname >>> from os.path import abspath >>> config = """ ... ... [transmogrifier] ... pipeline = ... webcrawler ... typerecognitor ... clean ... printer ... ... [webcrawler] ... blueprint = transmogrify.webcrawler ... site_url = file://%s/test_staticsite ... ... [typerecognitor] ... blueprint = transmogrify.webcrawler.typerecognitor ... ... [clean] ... blueprint = collective.transmogrifier.sections.manipulator ... delete = ... file ... text ... image ... ... [printer] ... blueprint = collective.transmogrifier.sections.tests.pprinter ... ... """ % abspath(dirname(__file__)).replace('\\','/')
>>> from collective.transmogrifier.tests import registerConfig >>> registerConfig(u'transmogrify.webcrawler.typerecognitor.test', config)
>>> from collective.transmogrifier.transmogrifier import Transmogrifier >>> transmogrifier = Transmogrifier(plone) >>> transmogrifier(u'transmogrify.webcrawler.typerecognitor.test') {... '_mimetype': 'image/jpeg', ... '_path': 'cia-plone-view-source.jpg', ... '_type': 'Image', ...} ...
- {‘_mimetype’: ‘image/gif’,
‘_path’: ‘/egenius-plone.gif’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Image’}
- {‘_mimetype’: ‘application/msword’,
‘_path’: ‘/file.doc’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: ‘doc_to_html’, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file1.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file2.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file3.html’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/file4.HTML’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘image/png’,
‘_path’: ‘/plone_schema.png’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Image’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/subfolder’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
- {‘_mimetype’: ‘text/html’,
‘_path’: ‘/subfolder/subfile1.htm’, ‘_site_url’: ‘file:///home/rok/Projects/pretaweb_dev/src/transmogrify.webcrawler/pretaweb/blueprints/test_staticsite’, ‘_transform’: None, ‘_type’: ‘Document’}
Changelog
1.0 - Unreleased
Initial release
transmogrify.webcrawler 0.1 - October 25, 2008
renamed package from pretaweb.blueprints to transmogrify.webcrawler. [djay]
enhanced import view (djay)
0.2
16-7-09 djay Added caching of crawled sites
10-7-09 djay Added UI using z3cform