Skip to main content

Python interface to a free corpus subset from ruscorpora.ru

Project description

This package provides Python interface to a free corpus subset available at http://ruscorpora.ru.

Installation

pip install ruscorpora-tools

Usage

Corpus downloading

Download and unpack the archive with XML files from http://www.ruscorpora.ru/corpora-usage.html

Corpus reading

ruscorpora.parse_xml function parses single XML file and returns an iterator over sentences; each sentence is a list of ruscorpora.Token instances, annotated with a list of ruscorpora.Annotation instances.

ruscorpora.simplify simplifies a result of ruscorpora.parse_xml by removing ambiguous annotations, joining split tokens (+ joining their annotations) and removing accent information.

>>> import ruscorpora as rnc
>>> for sent in rnc.simplify(rnc.parse('fiction.xml')):
...     print(sent)

Working with tags

ruscorpora.Tag class is a convenient wrapper for tags used in ruscorpora:

>>> tag = rnc.Tag('S,f,inan=sg,nom')
>>> tag.POS
'S'
>>> tag.gender
'f'
>>> tag.animacy
'inan'
>>> tag.number
'sg'
>>> tag.case
'nom'
>>> tag.tense
None

(there are also other attributes).

Check if a grammeme is in tag:

>>> 'S' in tag
True
>>> 'V' in tag
False
>>> 'Foo' in tag
Traceback (most recent call last)
...
ValueError: Grammeme is unknown: Foo

Test tags equality:

>>> tag == rnc.Tag('S,f,inan=sg,nom')
True
>>> tag == 'S,f,inan=sg,nom'
True
>>> tag == rnc.Tag('S,f,inan=sg,acc')
False
>>> tag == 'S,f,inan=sg,acc'
False
>>> tag == 'Foo,inan'
Traceback (most recent call last)
...
ValueError: Unknown grammemes: frozenset({Foo})

Tags returned by rnc.simplify are wrapped with this class by default.

Development

Development happens at github and bitbucket:

The issue tracker is at github: https://github.com/kmike/ruscorpora-tools/issues

Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.

Running tests

Make sure tox is installed and run

$ tox

from the source checkout. Tests should pass under python 2.6..3.3 and pypy > 1.8.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ruscorpora-tools-0.3.tar.gz (8.3 kB view details)

Uploaded Source

File details

Details for the file ruscorpora-tools-0.3.tar.gz.

File metadata

File hashes

Hashes for ruscorpora-tools-0.3.tar.gz
Algorithm Hash digest
SHA256 50b6c5845e1b7fba7ca71a7ba85376a5b63b5c413ae231434ae2fb5ec8d11936
MD5 8480be6c21f3b25d594fec11ea31f58f
BLAKE2b-256 3422560bfdc4947453075b72b6deeb81357ad6812f5c0e5818c2c10e928fd6f8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page