Skip to main content

A tool for learning vector representations of words and entities from Wikipedia

Project description

Wikipedia2Vec

tests pypi Version

Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia. It is developed and maintained by Studio Ousia.

This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space. Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.

This tool implements the conventional skip-gram model to learn the embeddings of words, and its extension proposed in Yamada et al. (2016) to learn the embeddings of entities.

An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available here.

Documentation are available online at http://wikipedia2vec.github.io/.

Basic Usage

Wikipedia2Vec can be installed via PyPI:

% pip install wikipedia2vec

With this tool, embeddings can be learned by running a train command with a Wikipedia dump as input. For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:

% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE

Then, the learned embeddings are written to MODEL_FILE. Note that this command can take many optional parameters. Please refer to our documentation for further details.

Pretrained Embeddings

Pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from this page.

Use Cases

Wikipedia2Vec has been applied to the following tasks:

References

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia.

@inproceedings{yamada2020wikipedia2vec,
  title = "{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia",
  author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
  year = {2020},
  publisher = {Association for Computational Linguistics},
  pages = {23--30}
}

The embedding model was originally proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation.

@inproceedings{yamada2016joint,
  title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
  author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
  booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
  year={2016},
  publisher={Association for Computational Linguistics},
  pages={250--259}
}

The text classification model implemented in this example was proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Neural Attentive Bag-of-Entities Model for Text Classification.

@article{yamada2019neural,
  title={Neural Attentive Bag-of-Entities Model for Text Classification},
  author={Yamada, Ikuya and Shindo, Hiroyuki},
  booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
  year={2019},
  publisher={Association for Computational Linguistics},
  pages = {563--573}
}

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wikipedia2vec-2.0.0.tar.gz (970.0 kB view details)

Uploaded Source

Built Distributions

wikipedia2vec-2.0.0-cp312-cp312-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.12 Windows x86-64

wikipedia2vec-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

wikipedia2vec-2.0.0-cp312-cp312-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

wikipedia2vec-2.0.0-cp312-cp312-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

wikipedia2vec-2.0.0-cp311-cp311-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.11 Windows x86-64

wikipedia2vec-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

wikipedia2vec-2.0.0-cp311-cp311-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

wikipedia2vec-2.0.0-cp311-cp311-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

wikipedia2vec-2.0.0-cp310-cp310-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.10 Windows x86-64

wikipedia2vec-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

wikipedia2vec-2.0.0-cp310-cp310-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

wikipedia2vec-2.0.0-cp310-cp310-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

wikipedia2vec-2.0.0-cp39-cp39-win_amd64.whl (1.5 MB view details)

Uploaded CPython 3.9 Windows x86-64

wikipedia2vec-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

wikipedia2vec-2.0.0-cp39-cp39-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.9 macOS 11.0+ ARM64

wikipedia2vec-2.0.0-cp39-cp39-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9 macOS 10.9+ x86-64

wikipedia2vec-2.0.0-cp38-cp38-win_amd64.whl (1.6 MB view details)

Uploaded CPython 3.8 Windows x86-64

wikipedia2vec-2.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

wikipedia2vec-2.0.0-cp38-cp38-macosx_11_0_arm64.whl (1.6 MB view details)

Uploaded CPython 3.8 macOS 11.0+ ARM64

wikipedia2vec-2.0.0-cp38-cp38-macosx_10_9_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8 macOS 10.9+ x86-64

File details

Details for the file wikipedia2vec-2.0.0.tar.gz.

File metadata

  • Download URL: wikipedia2vec-2.0.0.tar.gz
  • Upload date:
  • Size: 970.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for wikipedia2vec-2.0.0.tar.gz
Algorithm Hash digest
SHA256 191b9a80fb16653315385fc5ff4b26c92bf02dca5ffcd038d184ed6e61b6c350
MD5 780d2c582120aa557b0d81c977b951d2
BLAKE2b-256 cb1c07887bf23c3fa3dfc01bf0b6e3c02048a02f5d1c29e75b5569f1bd92e78a

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ddcc94fe9fa1132106f5c08da417e724400f38ace5f8d657352e6fd2831ab580
MD5 41a91e24bde4e4d01c82b24e7af5e96f
BLAKE2b-256 205eda15e9166f44c452fb1f3142b87cd20a9dc3c51d1c38482c44a1eb3454a6

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4cd6b33d84fc7faaec117e1312869b07906ab0fd03de82bf51e93aa81f5efa5b
MD5 6c3a258ef1d37e7f83a1b32767a6bd1e
BLAKE2b-256 4e7f71ceffdfb1e26a1302d56ee9e4568022df70aed93b7cabdd87157b7971eb

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 907a7912c0a982fe96c97918b46f3a22562443b236b666c03f88be2a1e7c6900
MD5 7e4a2636f1f2e00ee476a427451c3e54
BLAKE2b-256 01c2d3621f0ca49d89e79b67855431ac722678389df55c6e0b4b03bf129a6a21

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ff717fff738c4aab92c640f443515505816fcf4767ecb343a8e61fc1aaca66c2
MD5 b084a1d031ebb9f935070307e4d50a9a
BLAKE2b-256 e6ad7b9d54c2ed5647b7b15e0781100a582ee6a6f3c8a04223d27550bd0c35bb

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e17b1db1b858455753febb05334b65c2676dbcea1f4e66e7b1adaef5c4d52e6e
MD5 e937b5adabaccfb94e624d8c094cb7b1
BLAKE2b-256 9f7719d09f5b543ed3cacf0d7ded0cafdee61baf0f610ff5b3c296043cba86c0

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f554cefde337725866de9888bc567e8a1bc2b024f0b1f025ce9d8273bd229cff
MD5 e28a5e2f700c20129bf423e36eaaef99
BLAKE2b-256 8f4341c122f4b2a67d94fa154da796c8f9d41dc189e398637210e56802587d01

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 dc327989c3ac31b864468cb3882e2ddc6abb850efb2ae5605a316111182b144d
MD5 e3f1aedf94b356348d38a3207585cfe7
BLAKE2b-256 5a7ac40eebd9dad0230a16539cb7fd84779c19d39d207473c412b408074e3fcd

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3ec0da92623093e0ccb59249cb70fdbed9d679462135a618bff7e39f235d2fdb
MD5 dfcfe83f2d9e76b8c4e2175e653600a8
BLAKE2b-256 f967b011227824a7210760eaab8e4dc16bd8aeb5797db29ab5cb27052b8a7204

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 eda8c27a31538bf471a6ef44e9e351f5b4eba3852fb1dff49b524f0fdbd678c2
MD5 cff6f982a0a29bb3c8a43204b299e904
BLAKE2b-256 386ffc4cb6472d569b0349644d0776d5b638b2a9fa39e24ae6281932d51338bc

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1cb267258ec8ea59e68c0625a613700eea4a4a99ea77948f654005eca72d9441
MD5 7bfbf15d305d0b6dc67fe62d89a9a482
BLAKE2b-256 4f95583ff157469e665b892126d6168a6df7442c987a22049036910415211d67

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a2b70318e8e9ed7bf44fb321b272692de5e875314234d19e51d348133c74a97e
MD5 afc729cfb3001aaff067f309de48e747
BLAKE2b-256 fd0a04cdd6eb5c088fb4ceb5a2c0251f7ee659ed5bb2a8cb2ab38408f13c6e89

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 df3c47db08266fbc73112f7600ebd8ac4be6d2e474d632967591a666a5d3102a
MD5 e898dd4bf26f4e7ae77a60fcb29cf8ed
BLAKE2b-256 1d4ff1b94048afd56d643e2b684162d8f9e6ebeedbad964bea720578f656cbdd

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp39-cp39-win_amd64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 0ea17fd852158765c6328156f86d87507f9fac52b6a9121a99834d5c7aac13e1
MD5 30c7db1e7542744770455452d774d4c0
BLAKE2b-256 7d45e34d41f420d78c15d8d0c7ca8520664fc42aa229451c86a1ea1806fe1edb

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0e874978e0333788b0a92cb2039eabd629d01f1a912fd9e2e8872adc7ee8d22c
MD5 fda70e0afc10d6f6747a6ae3fdbfd522
BLAKE2b-256 f6a38124139e51b441682a4182173af7e77e739b60a30de709cffa34e9fdb5ab

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp39-cp39-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp39-cp39-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8f0658b2704c4c0b3c107d00b9e65b136e16bc27a835d10e5a90c69efe594a32
MD5 dc1277301e149e5bcad59d74ce253987
BLAKE2b-256 a16349fb110fc83ae8f5127ae5dbd2f19268a34aaa1f9458a77d20b964fedaac

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3f75bbfa99692215545f952dc71c8ede42c23def895d10d4b0cd3b686e74d6a7
MD5 ca3e88efca2e863e2784feb6db0b3753
BLAKE2b-256 055205757adfa27cbe1d9b44009be6cff58bf0877de1a0b700e7eb35594cd26d

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp38-cp38-win_amd64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 1cfbfb90478ec515b8635bb286ea01b9c64a0c1a50ef7e251ed15aa2e7fa56ee
MD5 ad8ae2e364e6be692300524c44cac401
BLAKE2b-256 6f229258a3b5cca042e461c07c2d94fc7c5f97a54557b1a2c0d367651d9824f3

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 747dcaeaba220f1b246881d2d6edbb4da5e344cba8733208ee37e72fe2dad2cb
MD5 40464083feae9e6d4d9da71fe3c429ad
BLAKE2b-256 f13c6972eb074133c6f2ef0810d7b56fafbd45c4ce0511b03048fb612c4d7e1f

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp38-cp38-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp38-cp38-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 23a3f8b080693683024ccbcf3534518fd8ba5e7d1a63241a9eecee2132afb8e3
MD5 c120886835c51d1a8ed711bcdede6f99
BLAKE2b-256 bf7e057d7e0ef6a34cc1cb1242a4d146071ffa5808c118b8636cce8538e08c12

See more details on using hashes here.

File details

Details for the file wikipedia2vec-2.0.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for wikipedia2vec-2.0.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 7bbbb8c69cd98a2fbc15d44bffc2c9b16228b060b4ed550fa960973e0b8c01fd
MD5 3cbe4cfbcffb83dcdf55cd8f34fb0e52
BLAKE2b-256 2661ed685e7c23ca454da813d380489e0bdb519606b0eb709f6c7116bec1616f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page