ftw.crawler

Crawl sites, extract text and metadata, index it in Solr

These details have not been verified by PyPI

Project links

Homepage

Project description

ftw.crawler

Installation

To install ftw.crawler, the easiest way is to create a buildout that contains the configuration, pulls in the egg using zc.recipe.egg and creates a script in the bin/ directory that directly launches the crawler with the respective configuration as an argument:

First, create a configuration file for the crawler. You can base your configuration on ftw/crawler/tests/assets/basic_config.py by copying it to your buildout and adapting it as needed.

Make sure to configure at least the tika and solr URLs to point to the correct locations of the respective services, and to adapt the sites list to your needs.

Create a buildout config that installs ftw.crawler using zc.recipe.egg:

crawler.cfg

[buildout]
parts +=
    crawler
    crawl-foo-org

[crawler]
recipe = zc.recipe.egg
eggs = ftw.crawler

Further define a buildout section that creates a bin/crawl-foo-org script, which will call bin/crawl foo_org_config.py using absolute paths (for easier use from cron jobs):
```
[crawl-foo-org]
recipe = collective.recipe.scriptgen
cmd = ${buildout:bin-directory}/crawl
arguments =
    ${buildout:directory}/foo_org_config.py
    --tika http://localhost:9998/
    --solr http://localhost:8983/solr
```
(The --tika and --solr command line arguments are optional, they can also be set in the configuration file. If given, the command line arguments take precedence over any parameters in the config file.)

Add a buildout config that downloads and configures a Tika JAXRS server:

tika-server.cfg

[buildout]
parts +=
    supervisor
    tika-server-download
    tika-server

[supervisor]
recipe = collective.recipe.supervisor
plugins =
      superlance
port = 8091
user = supervisor
password = admin
programs =
    10 tika-server (stopasgroup=true) ${buildout:bin-directory}/tika-server true your_os_user

[tika-server-download]
recipe = hexagonit.recipe.download
url = http://repo1.maven.org/maven2/org/apache/tika/tika-server/1.5/tika-server-1.5.jar
md5sum = 0f70548f233ead7c299bf7bc73bfec26
download-only = true
filename = tika-server.jar

[tika-server]
port = 9998
recipe = collective.recipe.scriptgen
cmd = java
arguments = -jar ${tika-server-download:destination}/${tika-server-download:filename} --port ${:port}

Modify your_os_user and the supervisor and Tika ports as needed.

Finally, add a bootstrap.py and create the buildout.cfg that pulls all of the above together:

buildout.cfg
```
[buildout]
extensions = mr.developer

extends =
    tika-server.cfg
    crawler.cfg
```
Bootstrap and run buildout:
```
python bootstrap.py
bin/buildout
```

Running the crawler

If you created the bin/crawl-foo-org script with the buildout described above, that’s all you need to run the crawler:

Make sure Tika and Solr are running
Run bin/crawl-foo-org (with either a relative or absolute path, working directory doesn’t matter, so it can easily be called from a cron job)

Running bin/crawl directly

The bin/crawl-foo-org is just a tiny wrapper that calls the bin/crawl script, generated by ftw.crawler’s setuptools console_script entry point, with the absolute path to the configuration file as the only argument. Any other arguments to the bin/crawl-foo-org script will be forwarded to bin/crawl.

Therefore running bin/crawl-foo-org [args] is equivalent to bin/crawl foo_org_config.py [args].

Provide known sitemap urls in site configs

If you know the sitemap url, you can configure one or many sitemap urls statically:

Site('http://example.org/foo/',
     sitemap_urls=['http://example.org/foo/the_sitemap.xml'])

Configure site ID for purging

In order for the purging to work smoothly it is recommend to configure a crawler site ID. Make sure that each site ID is unique per solr core! Candidate documents for purging will be identified by this crawler site id.

Site('http://example.org/',
     crawler_site_id='example.org-news')

Be aware that your solr core must provide a string-field crawler_site_id.

Indexing only a particular URL

If you only want to index a particular URL, pass that URL as the first argument to bin/crawl-foo-org. The crawler will then only fetch and index that specific URL.

Slack-Notifications

ftw.crawler supports Slack-Notifications. Those notifications can be used to monitor the crawler on possible errors while crawling. To enable slack-notifications for your environment, you need to do the following things:

Install ftw.crawler with the slack extra.
Set the SLACK_TOKEN and the SLACK_CHANNEL params in your crawler config or
use the –slacktoken and the –slackchannel arguments in the command line when calling the /crawl script.

To generate a valid slack token for your integration, you have to create a new bot in your slack-team. After you generated the new bot slack will automatically generate a valid token for this bot. This token can then be used for your integration. You can also generate a test token to test your integration, but don’t forget to create a bot for this if your application goes to production!

Development

To start hacking on ftw.crawler, use the development.cfg buildout:

ln -s development.cfg buildout.cfg
python bootstrap.py
bin/buildout

This will build a Tika JAXRS server and a Solr instance for you. The Solr configuration is set up to be compatible with the testing / example configuration at ftw/crawler/tests/assets/basic_config.py.

To run the crawler against the example configuration:

bin/tika-server
bin/solr-instance fg
bin/crawl ftw/crawler/tests/assets/basic_config.py

Copyright

This package is copyright by 4teamwork.

ftw.crawler is licensed under GNU General Public License, version 2.

Changelog

1.4.0 (2017-11-08)

Add crawler_site_id option for improving purging. [jone]

1.3.0 (2017-11-03)

Fix purging problem. Warning: updating “ftw.crawler” to this version breaks your existing crawlers when you set the site url to a sitemap url. Please use the “sitemap_urls” attribute instead. You also need to purge the Solr index manually and reindex. [jone]

1.2.1 (2017-10-30)

Encode URL in UTF-8 before generating MD5-Hash. [raphael-s]

1.2.0 (2017-06-22)

Support Slack notifications. [raphael-s]

1.1.0 (2016-10-04)

Support configuration of absolute sitemap urls. [jone]
Slow down on too many requests. [jone]

1.0 (2015-11-09)

Initial implementation. [lgraf]

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.4.0

Nov 8, 2017

1.3.0

Nov 3, 2017

1.2.1

Oct 30, 2017

1.2.0

Jun 22, 2017

1.1.0

Oct 4, 2016

1.0

Nov 9, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ftw.crawler-1.4.0.tar.gz (40.8 kB view hashes)

Uploaded Nov 8, 2017 Source

Hashes for ftw.crawler-1.4.0.tar.gz

Hashes for ftw.crawler-1.4.0.tar.gz
Algorithm	Hash digest
SHA256	`93d7c95a5666ee2987187e71b8da35aa857dfe1f11acb52a09ac711619c570e6`
MD5	`cfe8756cbb1c4c58ba10661d182a83dd`
BLAKE2b-256	`b0c5b1862cd643a191f2637fb8dc9d7b65000e6e32471cef96eda8518ef080e2`