Skip to main content

Sentence Embeddings using BERT / RoBERTa / XLM-R

Project description

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch

This framework provides an easy method to compute dense vector representations for sentences and paragraphs (also known as sentence embeddings). The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and are tuned specificially meaningul sentence embeddings such that sentences with similar meanings are close in vector space.

We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases.

Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task.

For the full documentation, see www.SBERT.net, as well as our publications:

Installation

We recommend Python 3.6 or higher, PyTorch 1.2.0 or higher and transformers v3.1.0 or higher. The code does not work with Python 2.7. Note, for some features and pre-trained models, you need PyTorch 1.6.0 or higher.

Install with pip

Install the sentence-transformers with pip:

pip install -U sentence-transformers

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

PyTorch with CUDA If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.

Getting Started

See Quickstart in our documenation.

This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.

First download a pretrained model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

Then provide some sentences to the model.

sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

And that's it already. We now have a list of numpy arrays with the embeddings.

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Pre-Trained Models

We provide a large list of Pretrained Models for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: SentenceTransformer('model_name').

» Full list of pretrained models

Training

This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.

See Training Overview for an introduction how to train your own embedding models. We provide various examples how to train models on various datasets.

Some highlights are:

  • Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Elextra, BART, ...
  • Multi-Lingual and multi-task learning
  • Evaluation during training to find optimal model
  • 10+ loss-functions allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, constrative loss.

Performance

Our models are evaluated extensively and achieve state-of-the-art performance on various tasks. Further, the code is tuned to provide the highest possible speed.

Model STS benchmark SentEval
Avg. GloVe embeddings 58.02 81.52
BERT-as-a-service avg. embeddings 46.35 84.04
BERT-as-a-service CLS-vector 16.50 84.66
InferSent - GloVe 68.03 85.59
Universal Sentence Encoder 74.92 85.10
Sentence Transformer Models
bert-base-nli-mean-tokens 77.12 86.37
bert-large-nli-mean-tokens 79.19 87.78
bert-base-nli-stsb-mean-tokens 85.14 86.07
bert-large-nli-stsb-mean-tokens 85.29 86.66
roberta-base-nli-stsb-mean-tokens 85.44 -
roberta-large-nli-stsb-mean-tokens 86.39 -
distilbert-base-nli-stsb-mean-tokens 85.16 -

Application Examples

We present some examples, how the generated sentence embeddings can be used for downstream applications.

Semantic Search

Semantic search is the task of finding similar sentences to a given sentence. See semantic_search.py for an example. See our documentation on more details about semantic search.

We first generate an embedding for all sentences in a corpus:

from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'A cheetah is running behind its prey.'
          ]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)

# Query sentences:
queries = ['A man is eating pasta.', 'Someone in a gorilla costume is playing a set of drums.', 'A cheetah chases prey on across a field.']


# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = 5
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    cos_scores = util.pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
    cos_scores = cos_scores.cpu()

    #We use torch.topk to find the highest 5 scores
    top_results = torch.topk(cos_scores, k=top_k)

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for score, idx in zip(top_results[0], top_results[1]):
        print(corpus[idx], "(Score: %.4f)" % (score))

The output looks like this:

Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.5777)
A man is eating a piece of bread. (Score: 0.4986)
A man is riding a horse. (Score: 0.1581)
A man is riding a white horse on an enclosed ground. (Score: 0.1474)
Two men pushed carts through the woods. (Score: 0.0992)

Query: Someone in a gorilla costume is playing a set of drums.
Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.6435)
A man is eating a piece of bread. (Score: 0.1719)
A man is eating food. (Score: 0.1240)
A man is riding a white horse on an enclosed ground. (Score: 0.0706)
A cheetah is running behind its prey. (Score: 0.0352)

Query: A cheetah chases prey on across a field.
Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.7769)
A man is riding a white horse on an enclosed ground. (Score: 0.2485)
A man is riding a horse. (Score: 0.2116)
A monkey is playing drums. (Score: 0.1820)
Two men pushed carts through the woods. (Score: 0.0869)

Clustering

clustering.py depicts an example to cluster similar sentences based on their sentence embedding similarity.

As before, we first compute an embedding for each sentence:

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

embedder = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')


# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.']

corpus_embeddings = embedder.encode(corpus)


# Then, we perform k-means clustering using sklearn:
from sklearn.cluster import KMeans

num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

The output looks like this:

Cluster  1
['The girl is carrying a baby.', 'The baby is carried by the woman']

Cluster  2
['A cheetah is running behind its prey.', 'A cheetah chases prey on across a field.']

Cluster  3
['A monkey is playing drums.', 'Someone in a gorilla costume is playing a set of drums.']

Cluster  4
['A man is eating food.', 'A man is eating a piece of bread.', 'A man is eating pasta.']

Cluster  5
['A man is riding a horse.', 'A man is riding a white horse on an enclosed ground.']

For more examples, see examples/applications.

Citing & Authors

If you find this repository helpful, feel free to cite our publication Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

If you use one of the multilingual models, feel free to cite our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation:

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

If you use the code for data augmentation, feel free to cite our publication Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks:

@article{thakur-2020-AugSBERT,
    title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
    author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and  Gurevych, Iryna", 
    journal= "arXiv preprint arXiv:2010.08240",
    month = "10",
    year = "2020",
    url = "https://arxiv.org/abs/2010.08240",
}

The main contributors of this repository are:

Contact person: Nils Reimers, info@nils-reimers.de

https://www.ukp.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentence-transformers-0.3.8.tar.gz (66.8 kB view details)

Uploaded Source

File details

Details for the file sentence-transformers-0.3.8.tar.gz.

File metadata

  • Download URL: sentence-transformers-0.3.8.tar.gz
  • Upload date:
  • Size: 66.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200330 requests-toolbelt/0.9.1 tqdm/4.50.1 CPython/3.6.10

File hashes

Hashes for sentence-transformers-0.3.8.tar.gz
Algorithm Hash digest
SHA256 315e7573870d6f7cda932434986823a3284aa80e37a124c30ff4d15540581a05
MD5 cb373bad7963a4fc5b0769897e550536
BLAKE2b-256 f4fd0190080aa0af78d7cd5874e4e8e85f0bed9967dd387cf05d760832b95da9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page