Keeping the original LASER project alive
Project description
laser-keep-alive is a project aimed at providing a stable run time environment for the open-source Facebook AI Research (FAIR) project, Language-Agnostic SEntence Representations (LASER).
Installation
Currently installation can only be done using the source code.
git clone https://github.com/mingruimingrui/laser-keep-alive.git
cd laser-keep-alive
python setup.py install
To ensure hardware compatibility, an explicit installation of
pytorch>=1.0
might be necessary.
Basic Usage
Script Example
To use this package in your python script, the easiest way is to import the
laser.SentenceEncoder
class.
from laser import SentenceEncoder
# Loading the model
sent_encoder = SentenceEncoder(
lang='en',
model_path=path_to_model_file,
bpe_codes=path_to_bpe_codes_file,
)
# Encode texts
# Given a List[str]
embeddings = sent_encoder.encode_sentences(list_of_texts)
# Where embeddings is a 2D np.ndarray
# of shape [num_texts, embedding_size]
Commandline Tool
laser-keep-alive can also be ran directly from the commandline.
$ python -m laser
usage: python -m laser [-h] {encode,filter} ...
Language-Agnostic SEntence Representations
positional arguments:
{encode,filter}
encode Encode a text file line by line
filter Filter a parallel corpus based on similarity
optional arguments:
-h, --help show this help message and exit
At the moment, the following commandline routines are provided.
encode
Encodes a text file line by line into sentence embeddings.
Output formats are .npy
and .csv
.
If you are using the pretrained-model, your embedding output will have
dimension size of 1024. In the case of .npy
output format, this corresponds
to byte sizes of 4096 for np.float32
and 2048 for np.float16
.
(Don't worry if you don't get that last sentence)
filter
Filters a parallel corpus line by line. Keeps only sentences which has euclidean distance below a threshold (default: 1.04). To apply a stricter filter, use a smaller threshold.
Downloading Pretrained Model
Pretrained models are necessary since this repository does not provide training code.
Please reference this script to download pretrained models.
Credits
Full credit goes to Holger Schwenk, the author of the LASER toolkit as well as FAIR. For more information regarding FAIR and LASER, please visit their webpages.
- FAIR Website: https://ai.facebook.com/
- FAIR Github: https://github.com/facebookresearch
- LASER Github: https://github.com/facebookresearch/LASER/
If you like this project, please visit the LASER project page and give it a star ⭐.
License
laser-keep-alive
is MIT-licensed and LASER
is BSD-licensed.
If you wish to use laser-keep-alive
please remember to include the
copyright notice.
Citation
Please cite Holger Schwenk and Matthijs Douze (also creator of FAISS).
@inproceedings{Schwenk2017LearningJM,
title={Learning Joint Multilingual Sentence Representations with Neural Machine Translation},
author={Holger Schwenk and Matthijs Douze},
booktitle={Rep4NLP@ACL},
year={2017},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file laser-keep-alive-1.0.0.tar.gz
.
File metadata
- Download URL: laser-keep-alive-1.0.0.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 297dbaeb55d6d9670d56115d3aacbd07b312138c06a33977c74df8090c0c7123 |
|
MD5 | c140c91ffd1c1137a7fb52db199abff5 |
|
BLAKE2b-256 | b0061d2f9c153484f56439c88e6aea4fcc7584abca8b23935abe146283229f99 |
File details
Details for the file laser_keep_alive-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: laser_keep_alive-1.0.0-py3-none-any.whl
- Upload date:
- Size: 20.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72b1f36e850129bc0c96fe5d6f368ab986c2b4106eee57f01932296baa5e4b7a |
|
MD5 | 645cc1ed9233110271f00368a38584a3 |
|
BLAKE2b-256 | 41924a4248b946df80dc05199d46edb9a483aa2ad022b927d884495a36590313 |