Skip to main content

Python interface to a free corpus subset from ruscorpora.ru

Project description

This package provides Python interface to a free corpus subset available at http://ruscorpora.ru.

Installation

pip install ruscorpora-tools

Usage

Obtaining corpora

Download and unpack the archive with XML files from http://www.ruscorpora.ru/corpora-usage.html

Using corpora

ruscorpora.parse_xml function parses single XML file and returns an iterator over sentences; each sentence is a list of ruscorpora.Token instances, annotated with a list of ruscorpora.Annotation instances.

ruscorpora.simplify simplifies a result of ruscorpora.parse_xml by removing ambiguous annotations, joining split tokens and removing accent information.

>>> import ruscorpora as rc
>>> for sent in rc.simplify(rc.parse('fiction.xml')):
...     print(sent)

Development

Development happens at github and bitbucket:

The issue tracker is at github: https://github.com/kmike/ruscorpora-tools/issues

Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.

Running tests

Make sure tox is installed and run

$ tox

from the source checkout. Tests should pass under python 2.6..3.3 and pypy > 1.8.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ruscorpora-tools-0.1.tar.gz (3.1 kB view details)

Uploaded Source

File details

Details for the file ruscorpora-tools-0.1.tar.gz.

File metadata

File hashes

Hashes for ruscorpora-tools-0.1.tar.gz
Algorithm Hash digest
SHA256 8a43df580ba55d3cc048f2dd1460903c285a1c392acb03fde86cf435cb7a832a
MD5 bfb99ec8cdb366ca1a48cf1e6b6099d7
BLAKE2b-256 9219c5b34fa9e659a3c45684e90452359a4287dda92c940496d8a696125cc795

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page