Skip to main content

A tool for learning vector representations of words and entities from Wikipedia

Project description

Wikipedia2Vec

tests pypi Version

Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia. It is developed and maintained by Studio Ousia.

This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space. Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.

This tool implements the conventional skip-gram model to learn the embeddings of words, and its extension proposed in Yamada et al. (2016) to learn the embeddings of entities.

An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available here.

Documentation are available online at http://wikipedia2vec.github.io/.

Basic Usage

Wikipedia2Vec can be installed via PyPI:

% pip install wikipedia2vec

With this tool, embeddings can be learned by running a train command with a Wikipedia dump as input. For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:

% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE

Then, the learned embeddings are written to MODEL_FILE. Note that this command can take many optional parameters. Please refer to our documentation for further details.

Pretrained Embeddings

Pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from this page.

Use Cases

Wikipedia2Vec has been applied to the following tasks:

References

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia.

@inproceedings{yamada2020wikipedia2vec,
  title = "{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia",
  author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year = {2020},
  publisher = {Association for Computational Linguistics},
  pages = {23--30}
}

The embedding model was originally proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation.

@inproceedings{yamada2016joint,
  title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
  author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
  booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
  year={2016},
  publisher={Association for Computational Linguistics},
  pages={250--259}
}

The text classification model implemented in this example was proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Neural Attentive Bag-of-Entities Model for Text Classification.

@article{yamada2019neural,
  title={Neural Attentive Bag-of-Entities Model for Text Classification},
  author={Yamada, Ikuya and Shindo, Hiroyuki},
  booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
  year={2019},
  publisher={Association for Computational Linguistics},
  pages = {563--573}
}

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikipedia2vec-2.0.0b1.tar.gz (970.0 kB view details)

Uploaded Source

Built Distributions

wikipedia2vec-2.0.0b1-cp312-cp312-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.12 Windows x86-64

wikipedia2vec-2.0.0b1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wikipedia2vec-2.0.0b1-cp312-cp312-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

wikipedia2vec-2.0.0b1-cp312-cp312-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

wikipedia2vec-2.0.0b1-cp311-cp311-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.11 Windows x86-64

wikipedia2vec-2.0.0b1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wikipedia2vec-2.0.0b1-cp311-cp311-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

wikipedia2vec-2.0.0b1-cp311-cp311-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wikipedia2vec-2.0.0b1-cp310-cp310-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.10 Windows x86-64

wikipedia2vec-2.0.0b1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wikipedia2vec-2.0.0b1-cp310-cp310-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

wikipedia2vec-2.0.0b1-cp310-cp310-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wikipedia2vec-2.0.0b1-cp39-cp39-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.9 Windows x86-64

wikipedia2vec-2.0.0b1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wikipedia2vec-2.0.0b1-cp39-cp39-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

wikipedia2vec-2.0.0b1-cp39-cp39-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

wikipedia2vec-2.0.0b1-cp38-cp38-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.8 Windows x86-64

wikipedia2vec-2.0.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

wikipedia2vec-2.0.0b1-cp38-cp38-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

wikipedia2vec-2.0.0b1-cp38-cp38-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file wikipedia2vec-2.0.0b1.tar.gz.

File metadata

  • Download URL: wikipedia2vec-2.0.0b1.tar.gz
  • Upload date:
  • Size: 970.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for wikipedia2vec-2.0.0b1.tar.gz
Algorithm Hash digest
SHA256 8c818cf46d829032060f49cf7affbb85c1b760cf037be736a51d8bad00d0a85d
MD5 c529731dfcbeccce347e2d3d8043ea59
BLAKE2b-256 a7a07fb5aa766b3ea0abd5c812cbb3632949b9dca809fc97de67fb365d2eec29

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 6e0154536501676de7ecb72885ad47d794bd251df0bee364d852d5f1f1efb0e5
MD5 91796d68088eb046c7a5e689df870c5b
BLAKE2b-256 5188aa840d73bdb62de623089c3c8a49c018997b61fdef9262fea3ee3b9d136c

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 453bd1948ddb530914b50580547b6805ba0fd665fc4ebd278bc2a69af374b976
MD5 4571e3ef2bca51092ed9d2b4d7ebc0e5
BLAKE2b-256 ebd2bb8f028f4b7184c9c5713b3036f5692193b3acb2c2765cb64ac7a2cef565

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9c3d95de200840ed02df8f9f3f071a23b136e023d6b0968fb0c9f507b3afe689
MD5 85f4a28fcfc2154a18456b5e354ec51f
BLAKE2b-256 8315ff60220e85a7cf0f05116da72fad334ec91bdd9eeee0376b46124516c800

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 fc6cb9c79da56537a240e943d6639747b936922a336ed6176d2821dea2d6d96a
MD5 9ac89a7d5d817c7916dcfb332ebdba0d
BLAKE2b-256 77ae3669b61410199c95b020952cc8a70608f5449f8694e495c8999a8594beee

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7fffefe309e32f2f2ea7575749bca512719a0d9236c34bde9c3f3d747d5c9749
MD5 3b6f3950cdec03ab216daa402842fcc1
BLAKE2b-256 96596d4f72a2a5517989f907edd237cd37f0aa9654e9da0a4f0a4f382d844c9f

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 749b14207a0fc1666b47092478837dacbbf50ea3a55c9392524a61dda7a18529
MD5 368e6b7cb895ea2423f7a23e83b8de9f
BLAKE2b-256 3207ffe7c9b6493e33464b69163cd8c71303df17968baa314328d968f169175a

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 753e9d195d5f7a9fe88b74025429115179c511d80985df7905bba5631fefd727
MD5 84710d7df2e06d39107afb8d39ddffc1
BLAKE2b-256 76752d026cc4795f4fbb6dd8b7926550a2771d8f8711a8983af3c9952e98afe0

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 82b9b7eb445de7ec6aeabb899f93cf79534b594cea8995dc5d5073f8106b7e17
MD5 188de8d31346b1a8227ce1a6ed65f105
BLAKE2b-256 e5609742e4a90e55940df14c99a689978ea16ddbd2277e8b1ee3724da79a3e75

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 99d1e99300d081f82f11cfa94d033d1ceaf1be9ca4dac6628c910bbb109ad130
MD5 940456ad907a2436fa8ed60f2ccc8ced
BLAKE2b-256 34ac2aeab2b8648262d9092b9ef99db33e0580c6badcd8e2f67a00a79cdb8b94

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6ecb0ee85008a45399c2266b087a022cc915cd723e52c67c181be97e27a18431
MD5 dc2ededd04d87bea6fc1c9f412c8db4d
BLAKE2b-256 56e17405d1f2c474a3a1b41d66646b4bc656c9cc299a2f04ae0cf4c64b3273f9

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c42111a11c079694de1298ea1e050b00c12401faf0d6c52cd7a335f3b7861fc6
MD5 9556060559669d42b5a868dd1096910d
BLAKE2b-256 c37d1fea7835cc58b525a517773dc5292e7df2e76a3c5b9aefb7fed9ad25c069

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 7c3d758ec00d1233eecec9843307c0bcb95d57c74f15b284167105a85462fef2
MD5 5001074ebc5de7787570c3f9a513c23f
BLAKE2b-256 f39032df1634791ef7437fb5b2ec056df704c10df9a04d957cb978dc02953d75

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 420526841e28685d290b4e60081a052e916b29a5684a179a1a5f13ea4507fbd0
MD5 c23611f55bb3eaeee194a92b7e4dff1d
BLAKE2b-256 6926636949d7cd7bc7fdcf05b68078ef18541484f0e31ba2e455a70154db9229

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 56ba279f477ce9473d13dd7920f25c68b2678f8c8a7d4fe5d2c00653fa4ec89a
MD5 b9fb8e89bfaec1cfd77011777f25823f
BLAKE2b-256 283957e0746a5d677b7ea9dbdff109823231bf10c728f4506f17b80295eaa182

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4d265ccadec58e8cf5daba96786e62584bda8cf8446027e1160415e4fe8268a7
MD5 aaa6cebf90062cc3b7208dfeab9acfcd
BLAKE2b-256 4edb4196efd2c9cb2416f1532b4d07fd74615e1ae4bcad494c1658bb412de6a6

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 968ebf201e262a5263d8e16c1eca45d7410ff6a734abc9a1956d7cc17820379f
MD5 aefce0dc840f12a4c2d2c29c74cdcc77
BLAKE2b-256 1599d0d655d35bfd1ac7b3bf3bb1317df2d22719a219e990d589f04ea2418d41

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 1e984ed1bd3600a697688d44114f708f6e575d4879e47fb11f7b3d44fd9b271d
MD5 2f6a4fb7d82110bf86fd639b8f66031e
BLAKE2b-256 9e075eda05ec8f7b710800c8732c6c5fa7cad21e11006e65963afbfe76131d61

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 423c1e5fc90f10bca30e3474de0f2326dc2790623ec7798e77243a83386c26cc
MD5 dd0af2d0f521137f511da43f632960c1
BLAKE2b-256 fa5f57ea565169b1939a945f5f6b6b7a87e4fc5f002b48ffa923fd8e9587daf1

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c5654e4b38d4a69fd2c6d45101376a7eff8a1d8e0ea20b064cb2e3b01ea14455
MD5 eb1597efe4cb05af0ae16786cd87954e
BLAKE2b-256 c55a8fb046645ab393fdd5ab7896c057cc47df0eddc78d1eb94ac4a6a06a1334

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0b1-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0b1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8673063b03e7d1a9ff45b708ac6df43889afbba75c31c80fb597da24c9072039
MD5 d20df3e7242d4f96e7259707cff03ff6
BLAKE2b-256 fe106e1aeecc706e415ce60efcd7141f5ab114617676c450dd3db36dfbf30eaf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page