Skip to main content

A tool for learning embeddings of words and entities from Wikipedia

Project description

Fury badge CircleCI

Introduction

Wikipedia2Vec is a tool for obtaining quality embeddings (vector representations) of words and Wikipedia entities from Wikipedia. It is developed and maintained by Studio Ousia.

This tool enables you to learn embeddings that map words and entities into a unified continuous vector space. The embeddings can be used as word embeddings, entity embeddings, and the unified embeddings of words and entities. They are used in the state-of-the-art models of various tasks such as entity linking, named entity recognition, entity relatedness, and question answering.

The embeddings can be easily trained from a publicly available Wikipedia dump. The code is implemented in Python, and optimized using Cython and BLAS.

How It Works

Wikipedia2Vec is based on the Word2vec’s skip-gram model that learns to predict neighboring words given each word in corpora. We extend the skip-gram model by adding the following two sub-models:

  • The KB link graph model that learns to estimate neighboring entities given an entity in the link graph of Wikipedia entities.

  • The anchor context model that learns to predict neighboring words given an entity using an anchor link pointing to the entity and their neighboring words.

By jointly optimizing the skip-gram model and these two sub-models, our model simultaneously learns the embedding of words and entities from Wikipedia. For further details, please refer to our paper: Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation.

Pretrained Embeddings

(coming soon)

Installation

If you want to train embeddings on your machine, it is highly recommended to install a BLAS library before installing this tool. We recommend to use OpenBLAS or Intel Math Kernel Library.

Wikipedia2Vec can be installed from PyPI:

% pip install Wikipedia2Vec

To process Japanese Wikipedia dumps, it is also required to install MeCab and its Python binding.

Learning Embeddings

First, you need to download a source Wikipedia dump file (e.g., enwiki-latest-pages-articles.xml.bz2) from Wikimedia Downloads. The English dump file can be obtained by:

% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Note that, you do not need to decompress the dump file.

Then, the embeddings can be trained from a Wikipedia dump using the train command:

% wikipedia2vec train DUMP_FILE OUT_FILE

Arguments: - DUMP_FILE: The Wikipedia dump file - OUT_FILE: The output file

Options:

  • –dim-size: The number of dimensions of the embeddings (default: 100)

  • –window: The maximum distance between the target item (word or entity) and the context word to be predicted (default: 5)

  • –iteration: The number of iterations for Wikipedia pages (default: 3)

  • –negative: The number of negative samples (default: 5)

  • –lowercase/–no-lowercase: Whether to lowercase words and phrases (default: True)

  • –min-word-count: A word is ignored if the total frequency of the word is lower than this value (default: 10)

  • –min-entity-count: An entity is ignored if the total frequency of the entity appearing as the referent of an anchor link is lower than this value (default: 5)

  • –link-graph/–no-link-graph: Whether to learn from the Wikipedia link graph (default: True)

  • –links-per-page: The number of contextual entities to be generated from the link graph for processing each page (default: 10)

  • –phrase/–no-phrase: Whether to learn the embeddings of phrases (default: True)

  • –min-link-count: A phrase is ignored if the total frequency of the phrase appearing as an anchor link is lower than this value (default: 10)

  • –min-link-prob: A phrase is ignored if the probability of the phrase appearing as an anchor link is lower than this value (default: 0.1)

  • –max-phrase-len: The maximum number of words in a phrase (default: 4)

  • –init-alpha: The initial learning rate (default: 0.025)

  • –min-alpha: The minimum learning rate (default: 0.0001)

  • –sample: The parameter that controls downsampling of high frequency words (default: 1e-4)

The train command internally calls the four commands described below (i.e., build_phrase_dictionary, build_dictionary, build_link_graph, and train_embedding).

Building Phrase Dictionary

The build_phrase_dictionary command constructs a dictionary consisting of phrases extracted from Wikipedia. We extract all phrases that appear as an anchor link in Wikipedia, and reduce them using the three thresholds such as min_link_count, min_link_prob, and max_phrase_len. Detected phrases are treated as words in the subsequent steps.

% wikipedia2vec build_phrase_dictionary DUMP_FILE OUT_FILE

Arguments: - DUMP_FILE: The Wikipedia dump file - OUT_FILE: The output file

Options:

  • –lowercase/–no-lowercase: Whether to lowercase phrases (default: True)

  • –min-link-count: A phrase is ignored if the total frequency of the phrase appearing as an anchor link is lower than this value (default: 10)

  • –min-link-prob: A phrase is ignored if the probability of the phrase appearing as an anchor link is lower than this value (default: 0.1)

  • –max-phrase-len: The maximum number of words in a phrase (default: 4)

Building Dictionary

The build_dictionary command builds a dictionary of words and entities.

% wikipedia2vec build_dictionary DUMP_FILE OUT_FILE

Arguments: - DUMP_FILE: The Wikipedia dump file - OUT_FILE: The output file

Options:

  • –phrase: The phrase dictionary file generated using the build_phrase_dictionary command

  • –lowercase/–no-lowercase: Whether to lowercase words (default: True)

  • –min-word-count: A word is ignored if the total frequency of the word is lower than this value (default: 10)

  • –min-entity-count: An entity is ignored if the total frequency of the entity appearing as the referent of an anchor link is lower than this value (default: 5)

Learning Embeddings

The train_embedding command runs the training of the embeddings.

% wikipedia2vec train_embedding DUMP_FILE DIC_FILE OUT_FILE

Arguments: - DUMP_FILE: The Wikipedia dump file - DIC_FILE: The dictionary file generated by the build_dictionary command - OUT_FILE: The output file

Options:

  • –link-graph: The link graph file generated using the build_link_graph command

  • –dim-size: The number of dimensions of the embeddings (default: 100)

  • –window: The maximum distance between the target item (word or entity) and the context word to be predicted (default: 5)

  • –iteration: The number of iterations for Wikipedia pages (default: 3)

  • –negative: The number of negative samples (default: 5)

  • –links-per-page: The number of contextual entities to be generated from the link graph for processing each page (default: 10)

  • –init-alpha: The initial learning rate (default: 0.025)

  • –min-alpha: The minimum learning rate (default: 0.0001)

  • –sample: The parameter that controls downsampling of high frequency words (default: 1e-4)

Saving Embeddings in Text Format

save_text outputs a model in a text format.

% wikipedia2vec save_text MODEL_FILE OUT_FILE

Arguments: - MODEL_FILE: The model file generated by the train_embedding command - OUT_FILE: The output file

There is no option in this command.

Sample Usage

>>> from wikipedia2vec import Wikipedia2Vec

>>> wiki2vec = Wikipedia2Vec.load(MODEL_FILE)

>>> wiki2vec.get_word_vector(u'the')
memmap([ 0.01617998, -0.03325786, -0.01397999, -0.00150471,  0.03237337,
...
       -0.04226106, -0.19677088, -0.31087297,  0.1071524 , -0.09824426], dtype=float32)

>>> wiki2vec.get_entity_vector(u'Scarlett Johansson')
memmap([-0.19793572,  0.30861306,  0.29620451, -0.01193621,  0.18228433,
...
        0.04986198,  0.24383858, -0.01466644,  0.10835337, -0.0697331 ], dtype=float32)

>>> wiki2vec.most_similar(wiki2vec.get_word(u'yoda'), 5)
[(<Word yoda>, 1.0),
 (<Entity Yoda>, 0.84333622),
 (<Word darth>, 0.73328167),
 (<Word kenobi>, 0.7328127),
 (<Word jedi>, 0.7223742)]

>>> wiki2vec.most_similar(wiki2vec.get_entity(u'Scarlett Johansson'), 5)
[(<Entity Scarlett Johansson>, 1.0),
 (<Entity Natalie Portman>, 0.75090045),
 (<Entity Eva Mendes>, 0.73651594),
 (<Entity Emma Stone>, 0.72868186),
 (<Entity Cameron Diaz>, 0.72390842)]

Reference

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

@InProceedings{yamada-EtAl:2016:CoNLL,
  author    = {Yamada, Ikuya  and  Shindo, Hiroyuki  and  Takeda, Hideaki  and  Takefuji, Yoshiyasu},
  title     = {Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
  booktitle = {Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  pages     = {250--259},
  publisher = {Association for Computational Linguistics}
}

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikipedia2vec-0.1.9.tar.gz (839.7 kB view details)

Uploaded Source

File details

Details for the file wikipedia2vec-0.1.9.tar.gz.

File metadata

File hashes

Hashes for wikipedia2vec-0.1.9.tar.gz
Algorithm Hash digest
SHA256 2668d7cd5743e81343b1b26a7b292460844da814a20f84d1a35b9006f2c91b00
MD5 ed7df41c0186301e10e3b93904ce540f
BLAKE2b-256 a6b283234527aeedd39fdb4d59c35e6dcf384fcda901e1101bce023efaf927f0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page