Skip to main content

Easy extraction of keywords from search engine results pages (SERPs).

Project description

https://travis-ci.org/Parsely/serpextract.png?branch=master

serpextract provides easy extraction of keywords from search engine results pages (SERPs).

This module is possible in large part to the very hard work of the Piwik team. Specifically, we make extensive use of their list of search engines.

Installation

Latest release on PyPI:

$ pip install serpextract

Or the latest development version:

$ pip install -e git://github.com/Parsely/serpextract.git#egg=serpextract

Usage

Command Line

Command-line usage, returns the engine name and keyword components separated by a comma and enclosed in quotes:

$ serpextract "http://www.google.ca/url?sa=t&rct=j&q=ars%20technica"
"Google","ars technica"

You can also print out a list of all the SearchEngineParsers currently available in your local cache via:

$ serpextract -l

Python

from serpextract import get_parser, extract, is_serp, get_all_query_params

non_serp_url = 'http://arstechnica.com/'
serp_url = ('http://www.google.ca/url?sa=t&rct=j&q=ars%20technica&source=web&cd=1&ved=0CCsQFjAA'
            '&url=http%3A%2F%2Farstechnica.com%2F&ei=pf7RUYvhO4LdyAHf9oGAAw&usg=AFQjCNHA7qjcMXh'
            'j-UX9EqSy26wZNlL9LQ&bvm=bv.48572450,d.aWc')

get_all_query_params()
# ['key', 'text', 'search_for', 'searchTerm', 'qrs', 'keyword', ...]

is_serp(serp_url)
# True
is_serp(non_serp_url)
# False

get_parser(serp_url)
# SearchEngineParser(engine_name='Google', keyword_extractor=['q'], link_macro='search?q={k}', charsets=['utf-8'])
get_parser(non_serp_url)
# None

extract(serp_url)
# ExtractResult(engine_name='Google', keyword=u'ars technica', parser=SearchEngineParser(...))
extract(non_serp_url)
# None

Tests

There are some basic tests for popular search engines, but more are required:

$ pip install -r requirements.txt
$ nosetests

Caching

Internally, this module caches an OrderedDict representation of Piwik’s list of search engines which is stored in serpextract/search_engines.pickle. This isn’t intended to change that often and so this module ships with a cached version. You can manually update the local cache via:

$ serpextract -u

This action currently requires PHP (we know, we know). We grab Piwik’s PHP array of all search engines, turn it into OrderedDict and store in pickle form. Ideally, we would have this search engine list in a language-independent form like JSON.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

serpextract-0.1.1.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

serpextract-0.1.1-py2.7.egg (26.3 kB view details)

Uploaded Source

File details

Details for the file serpextract-0.1.1.tar.gz.

File metadata

  • Download URL: serpextract-0.1.1.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for serpextract-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6826fc2840ef4678f7df1a9907bbc459177ed3cab59c697fe9914c481c31237b
MD5 8915934302396d35fdfadf2c492b5826
BLAKE2b-256 91ffcb746e188b834de6368da83c3244726ac6055db7fa28f60c74460fd98530

See more details on using hashes here.

Provenance

File details

Details for the file serpextract-0.1.1-py2.7.egg.

File metadata

File hashes

Hashes for serpextract-0.1.1-py2.7.egg
Algorithm Hash digest
SHA256 4e8ded0355496151f8939c9de05ccf96604f522b2f2b9469d2f79ad49bbdbe52
MD5 fb6cb2fcb86ea4e28939ef0c115b44bd
BLAKE2b-256 e9ea03577741b2e551ba62d7cb146286096bad5b52be3d5dfe8bbb3085bd481a

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page