textacy

Higher-level text processing, built on Spacy

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language
Topic
- Text Processing :: Linguistic

Project description

textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more.

Features

Functions for preprocessing raw text prior to analysis (whitespace normalization, URL/email/number/date replacement, unicode fixing/stripping, etc.)
Convenient interface to basic linguistic elements provided by Spacy (words, ngrams, noun phrases, etc.), along with standardized filtering options
Variety of functions for extracting information from text (particular POS patterns, subject-verb-object triples, acronyms and their definitions, direct quotations, etc.)
Unsupervised key term extraction (specific algorithms such as SGRank or TextRank, as well as a general semantic network-based approach)
Conversion of individual documents into common representations (bag of words), as well as corpora (term-document matrix, with TF or TF-IDF weighting, and filtering by these metrics or IC)
Common utility functions for identifying a text’s language, displaying key words in context (KWIC), truecasing words, and higher-level navigation of a parse tree

And more!

Installation

The simple way to install textacy is

$ pip install -U textacy

Or, download and unzip the source tar.gz from PyPi, then

$ python setup.py install

Example

>>> import textacy
>>>
>>> text = """
... Hell, it's about time someone told about my friend EPICAC. After all, he cost the taxpayers $776,434,927.54. They have a right to know about him, picking up a check like that. EPICAC got a big send off in the papers when Dr. Ormand von Kleigstadt designed him for the Government people. Since then, there hasn't been a peep about him -- not a peep. It isn't any military secret about what happened to EPICAC, although the Brass has been acting as though it were. The story is embarrassing, that's all. After all that money, EPICAC didn't work out the way he was supposed to.
... And that's another thing: I want to vindicate EPICAC. Maybe he didn't do what the Brass wanted him to, but that doesn't mean he wasn't noble and great and brilliant. He was all of those things. The best friend I ever had, God rest his soul.
... You can call him a machine if you want to. He looked like a machine, but he was a whole lot less like a machine than plenty of people I could name. That's why he fizzled as far as the Brass was concerned.
... """
>>> textacy.preprocess_text(text, lowercase=True, no_numbers=True, no_punct=True)
'hell its about time someone told about my friend epicac after all he cost the taxpayers number they have a right to know about him picking up a check like that epicac got a big send off in the papers when dr ormand von kleigstadt designed him for the government people since then there hasnt been a peep about him not a peep it isnt any military secret about what happened to epicac although the brass has been acting as though it were the story is embarrassing thats all after all that money epicac didnt work out the way he was supposed to\nand thats another thing i want to vindicate epicac maybe he didnt do what the brass wanted him to but that doesnt mean he wasnt noble and great and brilliant he was all of those things the best friend i ever had god rest his soul\nyou can call him a machine if you want to he looked like a machine but he was a whole lot less like a machine than plenty of people i could name thats why he fizzled as far as the brass was concerned'
>>> textacy.text_utils.keyword_in_context(text, 'EPICAC', window_width=40)
about time someone told about my friend  EPICAC . After all, he cost the taxpayers $776,
bout him, picking up a check like that.  EPICAC  got a big send off in the papers when D
 military secret about what happened to  EPICAC , although the Brass has been acting as
sing, that's all. After all that money,  EPICAC  didn't work out the way he was supposed
at's another thing: I want to vindicate  EPICAC . Maybe he didn't do what the Brass want
>>>
>>> doc = textacy.TextDoc(text.strip(), lang='auto',
...                       metadata={'title': 'EPICAC', 'author': 'Kurt Vonnegut'})
>>> print(doc)
TextDoc(230 tokens)
>>> doc.lang
'en'
>>>
>>> doc.ngrams(2, filter_stops=True, filter_punct=True)
[friend EPICAC.,
 taxpayers $,
 $776,434,927.54,
 check like,
 EPICAC got,
 big send,
 Dr. Ormand,
 Ormand von,
 von Kleigstadt,
 Kleigstadt designed,
 Government people,
 military secret,
 n't work,
 vindicate EPICAC.,
 EPICAC. Maybe,
 Brass wanted,
 n't mean,
 n't noble,
 best friend,
 God rest,
 looked like]
>>> doc.ngrams(3, filter_stops=True, filter_punct=True, min_freq=2)
[like a machine, like a machine]
>>> doc.named_entities(drop_determiners=True, bad_ne_types='numeric')
[Hell, EPICAC, Ormand von Kleigstadt, EPICAC, EPICAC, Brass, God]
>>> doc.pos_regex_matches(r'<DET> <NUM>* (<ADJ> <PUNCT>? <CONJ>?)* (<NOUN> <PART>?)+')
[the taxpayers,
 a right to,
 a check,
 the papers,
 the Government people,
 a peep,
 a peep,
 any military secret,
 the Brass,
 The story,
 that money,
 the way he,
 another thing,
 the Brass,
 those things,
 The best friend I,
 a machine,
 a machine,
 a whole lot,
 a machine,
 the Brass]
>>> doc.semistructured_statements('he', cue='be')
[(he, was, n't noble and great and brilliant),
 (He, was, all of those things),
 (he, was, a whole lot less like a machine than plenty of people I could name)]
>>> doc.key_terms(algorithm='textrank', n=5)
[('EPICAC', 0.06369346448602185),
 ('Brass', 0.051763452142722675),
 ('machine', 0.04761999319651037),
 ('friend', 0.045713561400759786),
 ('people', 0.043303827328545416)]
>>> doc.readability_stats()
{'automated_readability_index': 5.848928571428573,
 'coleman_liau_index': 9.577214607142864,
 'flesch_kincaid_grade_level': 3.7476190476190503,
 'flesch_readability_ease': 78.8807142857143,
 'gunning_fog_index': 4.780952380952381,
 'n_chars': 433,
 'n_polysyllable_words': 5,
 'n_sents': 14,
 'n_syllables': 121,
 'n_unique_words': 62,
 'n_words': 84,
 'smog_index': 6.5431188927421005}
>>> doc.term_count('EPICAC')
3
>>> bot = doc.as_bag_of_terms(weighting='tf', normalized=False,
...                           lemmatize='auto', ngram_range=(1, 2))
>>> [(doc.spacy_stringstore[term_id], count)
...  for term_id, count in bot.most_common(n=10)]
[('not', 6),
 ("'", 4),
 ('EPICAC', 3),
 ('want', 3),
 ('Brass', 3),
 ('like', 3),
 ('machine', 3),
 ('\n', 2),
 ('people', 2),
 ('friend', 2)]

Project Links

Authors

Burton DeWilde (<burton@chartbeat.net>)

Unofficial Roadmap

import/export for common formats
topic modeling via gensim and/or sklearn
distributional representations (word2vec etc.) via either gensim or spacy
basic dictionary-based methods (sentiment analysis?)
text classification
media frames analysis

TODO

extract: return generators rather than lists?
texts: figure out what to do when documents are modified in-place (doc.merge)
texts: ^ related: when docs modified, erase cached_property attributes so they’ll be re-caclulated
texts: ^related: update doc merge functions when Honnibal updates API
texts: what to do when new doc added to textcorpus does not have same language?
texts: have textdocs inherit _term_doc_freqs from textcorpus?
texts: add doc_to_bag_of_terms() func to transform?
transform: condense csc matrix by mapping stringstore term ints to incremented vals, starting at 0
drop scipy dependency and switch to honnibal’s own sparse matrices
preprocess: add basic tests for unidecode and ftfy functions

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language
Topic
- Text Processing :: Linguistic

Release history Release notifications | RSS feed

0.13.0

Apr 2, 2023

0.12.0

Dec 6, 2021

0.11.0

Apr 12, 2021

0.10.1

Aug 29, 2020

0.10.0

Mar 1, 2020

0.9.1

Sep 3, 2019

0.9.0

Sep 3, 2019

0.8.0

Jul 14, 2019

0.7.1

Jun 25, 2019

0.7.0

May 13, 2019

0.6.3

Mar 23, 2019

0.6.2

Jul 19, 2018

0.6.1

Apr 12, 2018

0.6.0

Feb 25, 2018

0.5.0

Dec 4, 2017

0.4.2

Nov 29, 2017

0.4.1

Jul 27, 2017

0.4.0

Jun 21, 2017

0.3.4

Apr 17, 2017

0.3.3

Feb 10, 2017

0.3.2

Nov 15, 2016

0.3.1

Oct 19, 2016

0.3.0

Aug 23, 2016

0.2.8

Aug 3, 2016

0.2.5

Jul 15, 2016

0.2.4

Jul 14, 2016

0.2.3

Jun 20, 2016

0.2.2

May 5, 2016

0.2.0

Apr 11, 2016

0.1.4

Feb 26, 2016

0.1.3

Feb 22, 2016

This version

0.1.1

Feb 11, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textacy-0.1.1.tar.gz (41.5 kB view details)

Uploaded Feb 11, 2016 Source

Built Distribution

textacy-0.1.1-py2.py3-none-any.whl (141.3 kB view details)

Uploaded Feb 11, 2016 Python 2 Python 3

File details

Details for the file textacy-0.1.1.tar.gz.

File metadata

Download URL: textacy-0.1.1.tar.gz
Upload date: Feb 11, 2016
Size: 41.5 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for textacy-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`c96bcd5b431387c51c329499b19de506cddbe5900f6558b9ab79c449c63cf0cd`
MD5	`3a98782add8706540ed5cf0403eeed15`
BLAKE2b-256	`ab235b73ba2d5b9a9e5a39fadbc9d0c6bbd9c76e87d34605a0fa9ee399b9dd0d`

See more details on using hashes here.

File details

Details for the file textacy-0.1.1-py2.py3-none-any.whl.

File metadata

Download URL: textacy-0.1.1-py2.py3-none-any.whl
Upload date: Feb 11, 2016
Size: 141.3 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for textacy-0.1.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`df4e385cfe4632ad8e6c3b36e34f461a9358630d186a3a73184e6f4bd61af548`
MD5	`e916f5730b212b69d9b6e1e4f1712610`
BLAKE2b-256	`850b8832664f9f1ae20bd02f2965f4a9e2188df43dddab2a2a089e75b8775eba`