Multilingual text embeddings

These details have not been verified by PyPI

Project links

Project description

Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.

This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various tasks. Text is embedded in vector space such that similar text are closer and can efficiently be found using cosine similarity.

We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases.

Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task.

For the full documentation, see www.SBERT.net.

Installation

We recommend Python 3.8+, PyTorch 1.11.0+, and transformers v4.34.0+.

Install with pip

pip install -U sentence-transformers

Install with conda

conda install -c conda-forge sentence-transformers

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

PyTorch with CUDA

If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.

Getting Started

See Quickstart in our documenation.

First download a pretrained model.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

Then provide some sentences to the model.

sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# => (3, 384)

And that's already it. We now have a numpy arrays with the embeddings, one for each text. We can use these to compute similarities.

similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

Pre-Trained Models

We provide a large list of Pretrained Models for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: SentenceTransformer('model_name').

Training

This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.

See Training Overview for an introduction how to train your own embedding models. We provide various examples how to train models on various datasets.

Some highlights are:

Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
Multi-Lingual and multi-task learning
Evaluation during training to find optimal model
20+ loss-functions allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss, etc.

Application Examples

You can use this framework for:

and many more use-cases.

For all examples, see examples/applications.

Development setup

After cloning the repo (or a fork) to your machine, in a virtual environment, run:

python -m pip install -e ".[dev]"

pre-commit install

To test your changes, run:

pytest

Citing & Authors

If you find this repository helpful, feel free to cite our publication Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

If you use one of the multilingual models, feel free to cite our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation:

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

Please have a look at Publications for our different publications that are integrated into SentenceTransformers.

Maintainer: Tom Aarsen, 🤗 Hugging Face

https://www.ukp.tu-darmstadt.de/

Don't hesitate to open an issue if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

3.3.1

Nov 18, 2024

3.3.0

Nov 11, 2024

3.2.1

Oct 21, 2024

3.2.0

Oct 10, 2024

3.1.1

Sep 19, 2024

3.1.0

Sep 11, 2024

3.0.1

Jun 7, 2024

This version

3.0.0

May 28, 2024

2.7.0

Apr 17, 2024

2.6.1

Mar 26, 2024

2.6.0

Mar 22, 2024

2.5.1

Mar 1, 2024

2.5.0

Feb 29, 2024

2.4.0

Feb 23, 2024

2.3.1

Jan 30, 2024

2.3.0

Jan 29, 2024

2.2.2

Jun 26, 2022

2.2.1

Jun 23, 2022

2.2.0

Feb 10, 2022

2.1.0

Oct 1, 2021

2.0.0

Jun 24, 2021

1.2.1

Jun 24, 2021

1.2.0

May 24, 2021

1.1.1

May 12, 2021

1.1.0

Apr 21, 2021

1.0.4

Apr 1, 2021

1.0.3

Mar 22, 2021

1.0.2

Mar 19, 2021

1.0.1

Mar 18, 2021

1.0.0

Mar 18, 2021

0.4.1.2

Jan 4, 2021

0.4.1.1

Jan 4, 2021

0.4.1

Jan 4, 2021

0.4.0

Dec 22, 2020

0.3.9

Nov 18, 2020

0.3.8

Oct 19, 2020

0.3.7.2

Oct 2, 2020

0.3.7.1

Oct 1, 2020

0.3.7

Sep 29, 2020

0.3.6

Sep 11, 2020

0.3.5.1

Sep 2, 2020

0.3.5

Sep 1, 2020

0.3.4

Aug 24, 2020

0.3.3

Aug 6, 2020

0.3.2

Jul 23, 2020

0.3.1

Jul 22, 2020

0.3.0

Jul 9, 2020

0.2.6.2

Jun 30, 2020

0.2.6.1

Apr 16, 2020

0.2.6 yanked

Apr 16, 2020

Reason this release was yanked:

Bug in the setup.py

0.2.5.1

Mar 13, 2020

0.2.5

Jan 10, 2020

0.2.4.1

Dec 6, 2019

0.2.4

Dec 6, 2019

0.2.3

Aug 20, 2019

0.2.2

Aug 19, 2019

0.2.1

Aug 16, 2019

0.2.0

Aug 16, 2019

0.1.0

Jul 25, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentence_transformers-3.0.0.tar.gz (174.7 kB view details)

Uploaded May 28, 2024 Source

Built Distribution

sentence_transformers-3.0.0-py3-none-any.whl (224.7 kB view details)

Uploaded May 28, 2024 Python 3

File details

Details for the file sentence_transformers-3.0.0.tar.gz.

File metadata

Download URL: sentence_transformers-3.0.0.tar.gz
Upload date: May 28, 2024
Size: 174.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for sentence_transformers-3.0.0.tar.gz
Algorithm	Hash digest
SHA256	`52d4101654ed107a28e9fa5110fce399084b55e7838fd8256471353ddc299033`
MD5	`03641192c8785b37b6459e9cbd364dfc`
BLAKE2b-256	`19c07051c672a48fe561decf7208cc18bbbdd4efa3323873aa1c86a3fb77fd97`

See more details on using hashes here.

File details

Details for the file sentence_transformers-3.0.0-py3-none-any.whl.

File metadata

Download URL: sentence_transformers-3.0.0-py3-none-any.whl
Upload date: May 28, 2024
Size: 224.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for sentence_transformers-3.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9bf851b688b796e5fb06c920921efd5e5e05ee616e85cb3026fbdfe4dcf15bf3`
MD5	`f230f9d8e13ee3145d667cd6b6317ec5`
BLAKE2b-256	`f8c499a9386808025d5a546576243bfd3b1eb669f978b8a0e05a1253eaf89bf0`