Skip to main content

glove-python fork for bicleaner-ai

Project description

bicleaner-ai-glove

NOTE: this a fork from glove-python made for bicleaner-ai.

Circle CI

A toy python implementation of GloVe.

Glove produces dense vector embeddings of words, where words that occur together are close in the resulting vector space.

While this produces embeddings which are similar to word2vec (which has a great python implementation in gensim), the method is different: GloVe produces embeddings by factorizing the logarithm of the corpus word co-occurrence matrix.

The code uses asynchronous stochastic gradient descent, and is implemented in Cython. Most likely, it contains a tremendous amount of bugs.

Installation

Install from pypi using pip: pip install glove_python.

Note for OSX users: due to its use of OpenMP, glove-python does not compile under Clang. To install it, you will need a reasonably recent version of gcc (from Homebrew for instance). This should be picked up by setup.py; if it is not, please open an issue.

Building with the default Python distribution included in OSX is also not supported; please try the version from Homebrew or Anaconda.

Usage

Producing the embeddings is a two-step process: creating a co-occurrence matrix from the corpus, and then using it to produce the embeddings. The Corpus class helps in constructing a corpus from an interable of tokens; the Glove class trains the embeddings (with a sklearn-esque API).

There is also support for rudimentary pagragraph vectors. A paragraph vector (in this case) is an embedding of a paragraph (a multi-word piece of text) in the word vector space in such a way that the paragraph representation is close to the words it contains, adjusted for the frequency of words in the corpus (in a manner similar to tf-idf weighting). These can be obtained after having trained word embeddings by calling the transform_paragraph method on the trained model.

Examples

example.py has some example code for running simple training scripts: ipython -i -- examples/example.py -c my_corpus.txt -t 10 should process your corpus, run 10 training epochs of GloVe, and drop you into an ipython shell where glove.most_similar('physics') should produce a list of similar words.

If you want to process a wikipedia corpus, you can pass file from here into the example.py script using the -w flag. Running make all-wiki should download a small wikipedia dump file, process it, and train the embeddings. Building the cooccurrence matrix will take some time; training the vectors can be speeded up by increasing the training parallelism to match the number of physical CPU cores available.

Running this on my machine yields roughly the following results:

In [1]: glove.most_similar('physics')
Out[1]:
[('biology', 0.89425889335342257),
 ('chemistry', 0.88913708236100086),
 ('quantum', 0.88859617025616333),
 ('mechanics', 0.88821824562025431)]

In [4]: glove.most_similar('north')
Out[4]:
[('west', 0.99047203572917908),
 ('south', 0.98655786905501008),
 ('east', 0.97914140138065575),
 ('coast', 0.97680427897282185)]

In [6]: glove.most_similar('queen')
Out[6]:
[('anne', 0.88284931171714842),
 ('mary', 0.87615260138308615),
 ('elizabeth', 0.87362497374226267),
 ('prince', 0.87011034923161801)]

In [19]: glove.most_similar('car')
Out[19]:
[('race', 0.89549347066796814),
 ('driver', 0.89350343749207217),
 ('cars', 0.83601334715106568),
 ('racing', 0.83157724991920212)]

Development

Pull requests are welcome.

When making changes to the .pyx extension files, you'll need to run python setup.py cythonize in order to produce the extension .c and .cpp files before running pip install -e ..

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bicleaner-ai-glove-0.2.0.tar.gz (333.0 kB view details)

Uploaded Source

Built Distributions

bicleaner_ai_glove-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

bicleaner_ai_glove-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

bicleaner_ai_glove-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

bicleaner_ai_glove-0.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

File details

Details for the file bicleaner-ai-glove-0.2.0.tar.gz.

File metadata

  • Download URL: bicleaner-ai-glove-0.2.0.tar.gz
  • Upload date:
  • Size: 333.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.8.13

File hashes

Hashes for bicleaner-ai-glove-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0b277fd2a136fae3a0186e6342425e06f202ffba35685627385e0947a01c8e58
MD5 557776e07caf681f11a3ef0d20621701
BLAKE2b-256 471463411d85c4339be6f713f09d154777f9da645ab54fba6ec9e505d96657bf

See more details on using hashes here.

File details

Details for the file bicleaner_ai_glove-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bicleaner_ai_glove-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e17012e2327e498561d12f54005107eb78ef57053396563894056943ee931c37
MD5 759d8cc3fead70520f95ecb22a037d7b
BLAKE2b-256 e52228849610e2c51d240447369d15a3ddfc9d95f7a9cccd64265306620c6efd

See more details on using hashes here.

File details

Details for the file bicleaner_ai_glove-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bicleaner_ai_glove-0.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 54177258ef128361e349a4c4e44c6095a53d042590e291bf8ef671d051eb65b0
MD5 206f330d232d605ba5554c6bde31c836
BLAKE2b-256 4cfe83957ff2d2933c8271c00cf1f6ba3d9a0fb9506ea0a3671366f90dcba7de

See more details on using hashes here.

File details

Details for the file bicleaner_ai_glove-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bicleaner_ai_glove-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2bff690d2dda4ca692aabaa509a9437a7194b73240353c0a5d195e11acad2c91
MD5 0f057369793029e3d14bbb89b8e5d0b2
BLAKE2b-256 360fc441ae48ce0e189ea12313dca5fcf65efdb10bad00bdbe7c76012eccf087

See more details on using hashes here.

File details

Details for the file bicleaner_ai_glove-0.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bicleaner_ai_glove-0.2.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 528cd75835b5b65718afcdb912230b7d551c0a3a25b53837f2ec87bc1f137056
MD5 f0cb28c9bd0acf15a174882975d141a3
BLAKE2b-256 fe69959af9e31a3f2d6c8ebde67b6acd958015704cb31f93330471d25dc0b4c3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page