oakx-spacy
Project description
oakx-spacy
Spacy + SciSpacy plugin for OAK.
ALPHA
Usage
Non-developers:
Create a preferred virtual environment (conda
, poetry
, venv
etc.). Install oakx-spacy
using pip install
.
pip install oakx-spacy
Next, desired models (Spacy and/or SciSpacy) need to be downloaded/installed. Following is the list of models available.
Spacy models
English pipelines optimized for CPU.
In order to install any of the below run python -m spacy download en_core_web_xxx
en_core_web_sm
: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.en_core_web_md
: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.en_core_web_lg
: Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.en_core_web_trf
: Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.
SciSpacy models
In order to install any of the below use the corresponding line in pyproject.toml
For example, if CRAFT corpus trained model is desired, do the following:
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_craft_md-0.5.1.tar.gz
Available models:
en_ner_craft_md
: A spaCy NER model trained on the CRAFT corpus.en_ner_jnlpba_md
: A spaCy NER model trained on the JNLPBA corpus.en_ner_bc5cdr_md
: A spaCy NER model trained on the BC5CDR corpus.en_ner_bionlp13cg_md
: A spaCy NER model trained on the BIONLP13CG corpus.en_core_sci_scibert
: A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model.en_core_sci_sm
: A full spaCy pipeline for biomedical data.en_core_sci_md
: A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors.en_core_sci_lg
: A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors.
SciSpacy linkers
These come preinstalled with scispacy
package itself. Available linkers are:
umls
: Links to the Unified Medical Language System, levels 0,1,2 and 9. This has ~3M concepts.mesh
: Links to the Medical Subject Headings. This contains a smaller set of higher quality entities, which are used for indexing in Pubmed. MeSH contains ~30k entities. NOTE: The MeSH KB is derived directly from MeSH itself, and as such uses different unique identifiers than the other KBs.rxnorm
: Links to the RxNorm ontology. RxNorm contains ~100k concepts focused on normalized names for clinical drugs. It is comprised of several other drug vocabularies commonly used in pharmacy management and drug interaction, including First Databank, Micromedex, and the Gold Standard Drug Database.go
: Links to the Gene Ontology. The Gene Ontology contains ~67k concepts focused on the functions of genes.hpo
: Links to the Human Phenotype Ontology. The Human Phenotype Ontology contains 16k concepts focused on phenotypic abnormalities encountered in human disease.
Developers:
Clone the repository
git clone https://github.com/hrshdhgd/oakx-spacy.git
Install poetry
pip install poetry
SciSpacy models
In pyproject.toml
, uncomment the 2 lines corresponding to the models desired. For example, if the desired model is the CRAFT corpus, uncomment the following:
[tool.poetry.dependencies.en_ner_craft_md]
url = "https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_craft_md-0.5.1.tar.gz"
Install dependencies
poetry install
Spacy models
Instructions similar to non-developers. Just make sure to prepend the command by poetry run
The default model is set to en_ner_craft_md
and default linker to umls
.
How it works
There are two possible inputs to this plugin:
- A
.txt
file [runoak -i spacy: annotate --text-file text.txt
] - Words that need to be annotated.[
runoak -i spacy: Myeloid derived suppressor cells (MDSC) are immature myeloid cells with immunosuppressive activity.
] - To use different combinations of models and linkers, the config.yaml file can be edited and
-c config.yaml
added at the end of the commands above.
Acknowledgements
This cookiecutter project was developed from the oakx-plugin-cookiecutter template and will be kept up-to-date using cruft.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for oakx_spacy-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6fae5c615e94a0251863845ddb7cee363bc3519f39ac1cba8d73191028d4601 |
|
MD5 | 8b1345ae0ab152e1217e07297785d466 |
|
BLAKE2b-256 | fe16aeea4ea3cabcc3e05387b89eb499e866e32be8626c08a87df17b75de401d |