Massive Text Embedding Benchmark
Project description
Massive Text Embedding Benchmark
Massive Text Embedding Benchmark - Internal Development Git
Installation
pip install git+https://github.com/embeddings-benchmark/mteb.git
Minimal use
- Using a python script:
from mteb import MTEB
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)
- Using CLI
mteb --available_tasks
mteb -m average_word_embeddings_komninos \
-t Banking77Classification NFCorpus \
--output_folder results \
--verbosity 3
Advanced usage
Tasks selection
Tasks can be selected by providing the list of tasks that needs to be run, but also
- by their types (e.g. "Clustering" or "Classification")
evaluation = MTEB(task_types=['Clustering', 'Retrieval']) # Only select clustering and retrieval tasks
- by their categories e.g. "S2S" (sentence to sentence) or "P2P" (paragraph to paragraph)
evaluation = MTEB(task_categories=['S2S']) # Only select sentence2sentence tasks
You can also specify which languages to load for multilingual/crosslingual tasks like this:
from mteb.tasks.BitextMining import BUCCBitextMining
evaluation = MTEB(tasks=[
BUCCBitextMining(langs=["de-en"]), # Only load "de-en" and fr-en" subsets of BUCC
AmazonReviewsClassification(langs=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews
])
Using a custom model
Models should implement the following interface, implementing an encode
function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be np.array
, torch.tensor
, etc.).
class MyModel():
def encode(self, sentences, batch_size=32):
""" Returns a list of embeddings for the given sentences.
Args:
sentences (`List[str]`): List of sentences to encode
batch_size (`int`): Batch size for the encoding
Returns:
`List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
"""
pass
model = MyModel()
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)
Evaluating on a custom task
To add a new task, you need to implement a new class that inherits from the AbsTask
associated with the task type (e.g. AbsTaskReranking
for reranking tasks). You can find the supported task types in here.
from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer
class MindSmallReranking(AbsTaskReranking):
@property
def description(self):
return {
"name": "MindSmallReranking",
"hf_hub_name": "mteb/mind_small",
"description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
"reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
"type": "Reranking",
"category": "s2s",
"eval_splits": ["validation"],
"eval_langs": ["en"],
"main_score": "map",
}
model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)
Note: for multilingual tasks, make sure your class also inherits from the
MultilingualTask
class like in this example.
Available tasks
Name | Hub URL | Description | Type | Category | N° Languages |
---|---|---|---|---|---|
BUCC | mteb/bucc-bitext-mining | BUCC bitext mining dataset | BitextMining | s2s | 4 |
Tatoeba | mteb/tatoeba-bitext-mining | 1,000 English-aligned sentence pairs for each language based on the Tatoeba corpus | BitextMining | s2s | 112 |
AmazonCounterfactualClassification | mteb/amazon_counterfactual | A collection of Amazon customer reviews annotated for counterfactual detection pair classification. | Classification | s2s | 4 |
AmazonPolarityClassification | mteb/amazon_polarity | Amazon Polarity Classification Dataset. | Classification | s2s | 1 |
AmazonReviewsClassification | mteb/amazon_reviews_multi | A collection of Amazon reviews specifically designed to aid research in multilingual text classification. | Classification | s2s | 6 |
Banking77Classification | mteb/banking77 | Dataset composed of online banking queries annotated with their corresponding intents. | Classification | s2s | 1 |
EmotionClassification | mteb/emotion | Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper. | Classification | s2s | 1 |
ImdbClassification | mteb/imdb | Large Movie Review Dataset | Classification | p2p | 1 |
MassiveIntentClassification | mteb/amazon_massive_intent | MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages | Classification | s2s | 51 |
MassiveScenarioClassification | mteb/amazon_massive_scenario | MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages | Classification | s2s | 51 |
MTOPDomainClassification | mteb/mtop_domain | MTOP: Multilingual Task-Oriented Semantic Parsing | Classification | s2s | 6 |
MTOPIntentClassification | mteb/mtop_intent | MTOP: Multilingual Task-Oriented Semantic Parsing | Classification | s2s | 6 |
ToxicConversationsClassification | mteb/toxic_conversations_50k | Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not. | Classification | s2s | 1 |
TweetSentimentExtractionClassification | mteb/tweet_sentiment_extraction | Classification | s2s | 1 | |
ArxivClusteringP2P | mteb/arxiv-clustering-p2p | Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category | Clustering | p2p | 1 |
ArxivClusteringS2S | mteb/arxiv-clustering-s2s | Clustering of titles from arxiv. Clustering of 30 sets, either on the main or secondary category | Clustering | s2s | 1 |
BiorxivClusteringP2P | mteb/biorxiv-clustering-p2p | Clustering of titles+abstract from biorxiv. Clustering of 10 sets, based on the main category. | Clustering | p2p | 1 |
BiorxivClusteringS2S | mteb/biorxiv-clustering-s2s | Clustering of titles from biorxiv. Clustering of 10 sets, based on the main category. | Clustering | s2s | 1 |
MedrxivClusteringP2P | mteb/medrxiv-clustering-p2p | Clustering of titles+abstract from medrxiv. Clustering of 10 sets, based on the main category. | Clustering | p2p | 1 |
MedrxivClusteringS2S | mteb/medrxiv-clustering-s2s | Clustering of titles from medrxiv. Clustering of 10 sets, based on the main category. | Clustering | s2s | 1 |
RedditClustering | mteb/reddit-clustering | Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. | Clustering | s2s | 1 |
RedditClusteringP2P | mteb/reddit-clustering-p2p | Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs. | Clustering | p2p | 1 |
StackExchangeClustering | mteb/stackexchange-clustering | Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. | Clustering | s2s | 1 |
StackExchangeClusteringP2P | mteb/stackexchange-clustering-p2p | Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs. | Clustering | p2p | 1 |
TwentyNewsgroupsClustering | mteb/twentynewsgroups-clustering | Clustering of the 20 Newsgroups dataset (subject only). | Clustering | s2s | 1 |
SprintDuplicateQuestions | mteb/sprintduplicatequestions-pairclassification | Duplicate questions from the Sprint community. | PairClassification | s2s | 1 |
TwitterSemEval2015 | mteb/twittersemeval2015-pairclassification | Paraphrase-Pairs of Tweets from the SemEval 2015 workshop. | PairClassification | s2s | 1 |
TwitterURLCorpus | mteb/twitterurlcorpus-pairclassification | Paraphrase-Pairs of Tweets. | PairClassification | s2s | 1 |
AskUbuntuDupQuestions | mteb/askubuntudupquestions-reranking | AskUbuntu Question Dataset - Questions from AskUbuntu with manual annotations marking pairs of questions as similar or non-similar | Reranking | s2s | 1 |
MindSmallReranking | mteb/mind_small | Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research | Reranking | s2s | 1 |
SciDocs | mteb/scidocs-reranking | Ranking of related scientific papers based on their title. | Reranking | s2s | 1 |
StackOverflowDupQuestions | mteb/stackoverflowdupquestions-reranking | Stack Overflow Duplicate Questions Task for questions with the tags Java, JavaScript and Python | Reranking | s2s | 1 |
ArguAna | nan | NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | Retrieval | s2s | 1 |
ClimateFEVER | nan | CLIMATE-FEVER is a dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change. | Retrieval | s2s | 1 |
CQADupstackRetrieval | nan | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2s | 1 |
DBPedia | nan | DBpedia-Entity is a standard test collection for entity search over the DBpedia knowledge base | Retrieval | s2s | 1 |
FEVER | nan | FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. | Retrieval | s2s | 1 |
FiQA2018 | nan | Financial Opinion Mining and Question Answering | Retrieval | s2s | 1 |
HotpotQA | nan | HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. | Retrieval | s2s | 1 |
MSMARCO | nan | MS MARCO is a collection of datasets focused on deep learning in search | Retrieval | s2s | 1 |
MSMARCOv2 | nan | nan | Retrieval | s2s | 1 |
NFCorpus | nan | NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | Retrieval | s2s | 1 |
NQ | nan | NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | Retrieval | s2s | 1 |
QuoraRetrieval | nan | QuoraRetrieval is based on questions that are marked as duplicates on the Quora platform. Given a question, find other (duplicate) questions. | Retrieval | s2s | 1 |
SCIDOCS | nan | SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. | Retrieval | s2s | 1 |
SciFact | nan | nan | Retrieval | s2s | 1 |
Touche2020 | nan | Touché Task 1: Argument Retrieval for Controversial Questions | Retrieval | s2s | 1 |
TRECCOVID | nan | nan | Retrieval | s2s | 1 |
BIOSSES | mteb/biosses-sts | Biomedical Semantic Similarity Estimation. | STS | s2s | 1 |
SICK-R | mteb/biosses-sts | Semantic Textual Similarity SICK-R dataset as described here: | STS | s2s | 1 |
STS12 | mteb/sts12-sts | SemEval STS 2012 dataset. | STS | s2s | 1 |
STS13 | mteb/sts13-sts | SemEval STS 2013 dataset. | STS | s2s | 1 |
STS14 | mteb/sts14-sts | SemEval STS 2014 dataset. Currently only the English dataset | STS | s2s | 1 |
STS15 | mteb/sts15-sts | SemEval STS 2015 dataset | STS | s2s | 1 |
STS16 | mteb/sts16-sts | SemEval STS 2016 dataset | STS | s2s | 1 |
STS17 | mteb/sts17-crosslingual-sts | STS 2017 dataset | STS | s2s | 11 |
STS22 | mteb/sts22-crosslingual-sts | SemEval 2022 Task 8: Multilingual News Article Similarity | STS | s2s | 18 |
STSBenchmark | mteb/stsbenchmark-sts | Semantic Textual Similarity Benchmark (STSbenchmark) dataset. | STS | s2s | 1 |
SummEval | mteb/summeval | Biomedical Semantic Similarity Estimation. | Summarization | s2s | 1 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file mteb-0.0.1.tar.gz
.
File metadata
- Download URL: mteb-0.0.1.tar.gz
- Upload date:
- Size: 63.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | af62d0c58bfcbb7bf46b5775f893bbe75b83836f82b78b284b9f213a84233da3 |
|
MD5 | af17063e58b4ca43e837abc1e6397726 |
|
BLAKE2b-256 | e10fca4a2c0e221f24169c335a99ec34f357ce612df318583f80096d1f3711b8 |