Massive Text Embedding Benchmark

These details have been verified by PyPI

Maintainers

KennethEnevoldsen Muennighoff nouamanetazi nreimers

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
- Information Technology
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python

Project description

Massive Text Embedding Benchmark

Installation | Usage | Leaderboard | Citing | Tasks

Installation

pip install mteb

Usage

Using a python script (see scripts/run_mteb_english.py and mteb/mtebscripts for more):

from mteb import MTEB
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "average_word_embeddings_komninos"

model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=["Banking77Classification"])
results = evaluation.run(model, output_folder=f"results/{model_name}")

Using CLI

mteb --available_tasks

mteb -m average_word_embeddings_komninos \
    -t Banking77Classification  \
    --output_folder results/average_word_embeddings_komninos \
    --verbosity 3

Using multiple GPUs in parallel can be done by just having a custom encode function that distributes the inputs to multiple GPUs like e.g. here or here.

Advanced usage

Dataset selection

Datasets can be selected by providing the list of datasets, but also

by their task (e.g. "Clustering" or "Classification")

evaluation = MTEB(task_types=['Clustering', 'Retrieval']) # Only select clustering and retrieval tasks

by their categories e.g. "S2S" (sentence to sentence) or "P2P" (paragraph to paragraph)

evaluation = MTEB(task_categories=['S2S']) # Only select sentence2sentence datasets

by their languages

evaluation = MTEB(task_langs=["en", "de"]) # Only select datasets which are "en", "de" or "en-de"

You can also specify which languages to load for multilingual/crosslingual tasks like below:

from mteb.tasks import AmazonReviewsClassification, BUCCBitextMining

evaluation = MTEB(tasks=[
        AmazonReviewsClassification(langs=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews
        BUCCBitextMining(langs=["de-en"]), # Only load "de-en" subset of BUCC
])

There are also presets available for certain task collections, e.g. to select the 56 English datasets that form the "Overall MTEB English leaderboard":

from mteb import MTEB_MAIN_EN
evaluation = MTEB(tasks=MTEB_MAIN_EN, task_langs=["en"])

Evaluation split

You can evaluate only on test splits of all tasks by doing the following:

evaluation.run(model, eval_splits=["test"])

Note that the public leaderboard uses the test splits for all datasets except MSMARCO, where the "dev" split is used.

Using a custom model

Models should implement the following interface, implementing an encode function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be np.array, torch.tensor, etc.). For inspiration, you can look at the mteb/mtebscripts repo used for running diverse models via SLURM scripts for the paper.

class MyModel():
    def encode(self, sentences, batch_size=32, **kwargs):
        """
        Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

model = MyModel()
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)

If you'd like to use different encoding functions for query and corpus when evaluating on Retrieval or Reranking tasks, you can add separate methods for encode_queries and encode_corpus. If these methods exist, they will be automatically used for those tasks. You can refer to the DRESModel at mteb/mteb/abstasks/AbsTaskRetrieval.py for an example of these functions.

class MyModel():
    def encode_queries(self, queries, batch_size=32, **kwargs):
        """
        Returns a list of embeddings for the given sentences.
        Args:
            queries (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

    def encode_corpus(self, corpus, batch_size=32, **kwargs):
        """
        Returns a list of embeddings for the given sentences.
        Args:
            corpus (`List[str]` or `List[Dict[str, str]]`): List of sentences to encode
                or list of dictionaries with keys "title" and "text"
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

Evaluating on a custom task

To add a new task, you need to implement a new class that inherits from the AbsTask associated with the task type (e.g. AbsTaskReranking for reranking tasks). You can find the supported task types in here.

from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer


class MindSmallReranking(AbsTaskReranking):
    @property
    def description(self):
        return {
            "name": "MindSmallReranking",
            "hf_hub_name": "mteb/mind_small",
            "description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
            "reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
            "type": "Reranking",
            "category": "s2s",
            "eval_splits": ["validation"],
            "eval_langs": ["en"],
            "main_score": "map",
        }

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)

Note: for multilingual tasks, make sure your class also inherits from the MultilingualTask class like in this example.

Leaderboard

The MTEB Leaderboard is available here. To submit:

Run on MTEB: You can reference scripts/run_mteb_english.py for all MTEB English datasets used in the main ranking, or scripts/run_mteb_chinese.py for the Chinese ones. Advanced scripts with different models are available in the mteb/mtebscripts repo.
Format the json files into metadata using the script at scripts/mteb_meta.py. For example python scripts/mteb_meta.py path_to_results_folder, which will create a mteb_metadata.md file. If you ran CQADupstack retrieval, make sure to merge the results first with python scripts/merge_cqadupstack.py path_to_results_folder.
Copy the content of the mteb_metadata.md file to the top of a README.md file of your model on the Hub. See here for an example.
Hit the Refresh button at the bottom of the leaderboard and you should see your scores 🥇
To have the scores appear without refreshing, you can open an issue on the Community Tab of the LB and someone will restart the space to cache your average scores. The cache is updated anyways ~1x/week.

Citing

MTEB was introduced in "MTEB: Massive Text Embedding Benchmark", feel free to cite:

@article{muennighoff2022mteb,
  doi = {10.48550/ARXIV.2210.07316},
  url = {https://arxiv.org/abs/2210.07316},
  author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
  title = {MTEB: Massive Text Embedding Benchmark},
  publisher = {arXiv},
  journal={arXiv preprint arXiv:2210.07316},  
  year = {2022}
}

You may also want to read and cite the amazing work that has extended MTEB & integrated new datasets:

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff. "C-Pack: Packaged Resources To Advance General Chinese Embedding" arXiv 2023
Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, Han Xiao. "Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents" arXiv 2023
Silvan Wehrli, Bert Arnrich, Christopher Irrgang. "German Text Embedding Clustering Benchmark" arXiv 2024

For works that have used MTEB for benchmarking, you can find them on the leaderboard.

Available tasks

Name	Hub URL	Description	Type	Category	#Languages	Train #Samples	Dev #Samples	Test #Samples	Avg. chars / train	Avg. chars / dev	Avg. chars / test
BUCC	mteb/bucc-bitext-mining	BUCC bitext mining dataset	BitextMining	s2s	4	0	0	641684	0	0	101.3
Tatoeba	mteb/tatoeba-bitext-mining	1,000 English-aligned sentence pairs for each language based on the Tatoeba corpus	BitextMining	s2s	112	0	0	2000	0	0	39.4
Bornholm parallel	strombergnlp/bornholmsk_parallel	Danish Bornholmsk Parallel Corpus.	BitextMining	s2s	2	100	100	100	64.6	86.2	89.7
DiaBLaBitextMining	rbawden/DiaBLa	English-French Parallel Corpus. DiaBLa is an English-French dataset for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue.	BitextMining	s2s	1	5748	0	0	0	0	0
FloresBitextMining	facebook/flores	FLORES is a benchmark dataset for machine translation between English and low-resource languages.	BitextMining	s2s	200	0	997	1012	0	0	0
AmazonCounterfactualClassification	mteb/amazon_counterfactual	A collection of Amazon customer reviews annotated for counterfactual detection pair classification.	Classification	s2s	4	4018	335	670	107.3	109.2	106.1
AmazonPolarityClassification	mteb/amazon_polarity	Amazon Polarity Classification Dataset.	Classification	s2s	1	3600000	0	400000	431.6	0	431.4
AmazonReviewsClassification	mteb/amazon_reviews_multi	A collection of Amazon reviews specifically designed to aid research in multilingual text classification.	Classification	s2s	6	1200000	30000	30000	160.5	159.2	160.4
MasakhaNEWSClassification	masakhane/masakhanews	MasakhaNEWS is the largest publicly available dataset for news topic classification in 16 languages widely spoken in Africa. The train/validation/test sets are available for all the 16 languages.	Classification	s2s	16	1476	211	422	5064.8	4756.1	5116.6
Banking77Classification	mteb/banking77	Dataset composed of online banking queries annotated with their corresponding intents.	Classification	s2s	1	10003	0	3080	59.5	0	54.2
EmotionClassification	mteb/emotion	Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.	Classification	s2s	1	16000	2000	2000	96.8	95.3	96.6
ImdbClassification	mteb/imdb	Large Movie Review Dataset	Classification	p2p	1	25000	0	25000	1325.1	0	1293.8
MassiveIntentClassification	mteb/amazon_massive_intent	MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages	Classification	s2s	51	11514	2033	2974	35.0	34.8	34.6
MassiveScenarioClassification	mteb/amazon_massive_scenario	MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages	Classification	s2s	51	11514	2033	2974	35.0	34.8	34.6
MTOPDomainClassification	mteb/mtop_domain	MTOP: Multilingual Task-Oriented Semantic Parsing	Classification	s2s	6	15667	2235	4386	36.6	36.5	36.8
MTOPIntentClassification	mteb/mtop_intent	MTOP: Multilingual Task-Oriented Semantic Parsing	Classification	s2s	6	15667	2235	4386	36.6	36.5	36.8
ToxicConversationsClassification	mteb/toxic_conversations_50k	Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not.	Classification	s2s	1	50000	0	50000	298.8	0	296.6
TweetSentimentExtractionClassification	mteb/tweet_sentiment_extraction		Classification	s2s	1	27481	0	3534	68.3	0	67.8
AngryTweetsClassification	mteb/DDSC/angry-tweets	A sentiment dataset with 3 classes (positiv, negativ, neutral) for Danish tweets	Classification	s2s	1	2410	0	1050	153.0	0	156.1
DKHateClassification	DDSC/dkhate	Danish Tweets annotated for Hate Speech	Classification	s2s	1	2960	0	329	88.2	0	104.0
DalajClassification	AI-Sweden/SuperLim	A Swedish dataset for linguistic accebtablity. Available as a part of Superlim	Classification	s2s	1	3840	445	444	243.7	242.5	243.8
DanishPoliticalCommentsClassification	danish_political_comments	A dataset of Danish political comments rated for sentiment	Classification	s2s	1	9010	0	0	69.9	0	0
LccClassification	DDSC/lcc	The leipzig corpora collection, annotated for sentiment	Classification	s2s	1	349	0	150	113.5	0	118.7
NoRecClassification	ScandEval/norec-mini	A Norwegian dataset for sentiment classification on review	Classification	s2s	1	1020	256	2050	86.9	89.6	82.0
NordicLangClassification	strombergnlp/nordic_langid	A dataset for Nordic language identification.	Classification	s2s	6	57000	0	3000	78.4	0	78.2
NorwegianParliamentClassification	NbAiLab/norwegian_parliament	Norwegian parliament speeches annotated for sentiment	Classification	s2s	1	3600	1200	1200	1773.6	1911.0	1884.0
ScalaDaClassification	ScandEval/scala-da	A modified version of DDT modified for linguistic acceptability classification	Classification	s2s	1	1024	256	2048	107.6	100.8	109.4
ScalaNbClassification	ScandEval/scala-nb	A Norwegian dataset for linguistic acceptability classification for Bokmål	Classification	s2s	1	1024	256	2048	95.5	94.8	98.4
ScalaNnClassification	ScandEval/scala-nn	A Norwegian dataset for linguistic acceptability classification for Nynorsk	Classification	s2s	1	1024	256	2048	105.3	103.5	104.8
ScalaSvClassification	ScandEval/scala-sv	A Swedish dataset for linguistic acceptability classification	Classification	s2s	1	1024	256	2048	102.6	113.0	98.3
SweRecClassificition	ScandEval/swerec-mini	A Swedish dataset for sentiment classification on reviews	Classification	s2s	1	1024	256	2048	317.7	293.4	318.8
CBD	PL-MTEB/cbd	Polish Tweets annotated for cyberbullying detection.	Classification	s2s	1	10041	0	1000	93.6	0	93.2
PolEmo2.0-IN	PL-MTEB/polemo2_in	A collection of Polish online reviews from four domains: medicine, hotels, products and school. The PolEmo2.0-IN task is to predict the sentiment of in-domain (medicine and hotels) reviews.	Classification	s2s	1	5783	723	722	780.6	769.4	756.2
PolEmo2.0-OUT	PL-MTEB/polemo2_out	A collection of Polish online reviews from four domains: medicine, hotels, products and school. The PolEmo2.0-OUT task is to predict the sentiment of out-of-domain (products and school) reviews using models train on reviews from medicine and hotels domains.	Classification	s2s	1	5783	494	494	780.6	589.3	587.0
AllegroReviews	PL-MTEB/allegro-reviews	A Polish dataset for sentiment classification on reviews from e-commerce marketplace Allegro.	Classification	s2s	1	9577	1002	1006	477.9	480.9	477.2
PAC	laugustyniak/abusive-clauses-pl	Polish Abusive Clauses Dataset	Classification	s2s	1	4284	1519	3453	185.3	256.8	185.3
AlloProfClusteringP2P	lyon-nlp/alloprof	Clustering of document titles and descriptions from Allo Prof dataset. Clustering of 10 sets on the document topic.	Clustering	p2p	1	2798	0	0	0	0	0
AlloProfClusteringS2S	lyon-nlp/alloprof	Clustering of document titles from Allo Prof dataset. Clustering of 10 sets on the document topic.	Clustering	s2s	1	2798	0	0	0	0	0
ArxivClusteringP2P	mteb/arxiv-clustering-p2p	Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category	Clustering	p2p	1	0	0	732723	0	0	1009.9
ArxivClusteringS2S	mteb/arxiv-clustering-s2s	Clustering of titles from arxiv. Clustering of 30 sets, either on the main or secondary category	Clustering	s2s	1	0	0	732723	0	0	74.0
BiorxivClusteringP2P	mteb/biorxiv-clustering-p2p	Clustering of titles+abstract from biorxiv. Clustering of 10 sets, based on the main category.	Clustering	p2p	1	0	0	75000	0	0	1666.2
BiorxivClusteringS2S	mteb/biorxiv-clustering-s2s	Clustering of titles from biorxiv. Clustering of 10 sets, based on the main category.	Clustering	s2s	1	0	0	75000	0	0	101.6
BlurbsClusteringP2P	slvnwhrl/blurbs-clustering-p2p	Clustering of book titles+blurbs. Clustering of 28 sets, either on the main or secondary genre	Clustering	p2p	1	0	0	174637	0	0	664.09
BlurbsClusteringS2S	slvnwhrl/blurbs-clustering-s2s	Clustering of book titles. Clustering of 28 sets, either on the main or secondary genre.	Clustering	s2s	1	0	0	174637	0	0	23.02
HALClusteringS2S	lyon-nlp/clustering-hal-s2s	Clustering of titles from HAL. Clustering of 10 sets on the main category.	Clustering	s2s	1	85375	0	0	0	0	0
MedrxivClusteringP2P	mteb/medrxiv-clustering-p2p	Clustering of titles+abstract from medrxiv. Clustering of 10 sets, based on the main category.	Clustering	p2p	1	0	0	37500	0	0	1981.2
MedrxivClusteringS2S	mteb/medrxiv-clustering-s2s	Clustering of titles from medrxiv. Clustering of 10 sets, based on the main category.	Clustering	s2s	1	0	0	37500	0	0	114.7
RedditClustering	mteb/reddit-clustering	Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.	Clustering	s2s	1	0	0	420464	0	0	64.7
RedditClusteringP2P	mteb/reddit-clustering-p2p	Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs.	Clustering	p2p	1	0	0	459399	0	0	727.7
StackExchangeClustering	mteb/stackexchange-clustering	Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.	Clustering	s2s	1	0	417060	373850	0	56.8	57.0
StackExchangeClusteringP2P	mteb/stackexchange-clustering-p2p	Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs.	Clustering	p2p	1	0	0	75000	0	0	1090.7
TenKGnadClusteringP2P	slvnwhrl/tenkgnad-clustering-p2p	Clustering of news article titles+subheadings+texts. Clustering of 10 splits on the news article category.	Clustering	p2p	1	0	0	45914	0	0	2641.03
TenKGnadClusteringS2S	slvnwhrl/tenkgnad-clustering-s2s	Clustering of news article titles. Clustering of 10 splits on the news article category.	Clustering	s2s	1	0	0	45914	0	0	50.96
TwentyNewsgroupsClustering	mteb/twentynewsgroups-clustering	Clustering of the 20 Newsgroups dataset (subject only).	Clustering	s2s	1	0	0	59545	0	0	32.0
8TagsClustering	PL-MTEB/8tags-clustering	Clustering of headlines from social media posts in Polish belonging to 8 categories: film, history, food, medicine, motorization, work, sport and technology.	Clustering	s2s	1	40001	5000	4372	78.2	77.6	79.2
OpusparcusPC	GEM/opusparcus	Opusparcus is a paraphrase corpus for six European language: German, English, Finnish, French, Russian, and Swedish. The paraphrases consist of subtitles from movies and TV shows.	PairClassification	s2s	6	1007	0	0	0	0	0
SprintDuplicateQuestions	mteb/sprintduplicatequestions-pairclassification	Duplicate questions from the Sprint community.	PairClassification	s2s	1	0	101000	101000	0	65.2	67.9
TwitterSemEval2015	mteb/twittersemeval2015-pairclassification	Paraphrase-Pairs of Tweets from the SemEval 2015 workshop.	PairClassification	s2s	1	0	0	16777	0	0	38.3
TwitterURLCorpus	mteb/twitterurlcorpus-pairclassification	Paraphrase-Pairs of Tweets.	PairClassification	s2s	1	0	0	51534	0	0	79.5
PPC	PL-MTEB/ppc-pairclassification	Polish Paraphrase Corpus	PairClassification	s2s	1	5000	1000	1000	41.0	41.0	40.2
PSC	PL-MTEB/psc-pairclassification	Polish Summaries Corpus	PairClassification	s2s	1	4302	0	1078	537.1	0	549.3
SICK-E-PL	PL-MTEB/sicke-pl-pairclassification	Polish version of SICK dataset for textual entailment.	PairClassification	s2s	1	4439	495	4906	43.4	44.7	43.2
CDSC-E	PL-MTEB/cdsce-pairclassification	Compositional Distributional Semantics Corpus for textual entailment.	PairClassification	s2s	1	8000	1000	1000	71.9	73.5	75.2
AskUbuntuDupQuestions	mteb/askubuntudupquestions-reranking	AskUbuntu Question Dataset - Questions from AskUbuntu with manual annotations marking pairs of questions as similar or non-similar	Reranking	s2s	1	0	0	2255	0	0	52.5
MindSmallReranking	mteb/mind_small	Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research	Reranking	s2s	1	231530	0	107968	69.0	0	70.9
SciDocsRR	mteb/scidocs-reranking	Ranking of related scientific papers based on their title.	Reranking	s2s	1	0	19594	19599	0	69.4	69.0
StackOverflowDupQuestions	mteb/stackoverflowdupquestions-reranking	Stack Overflow Duplicate Questions Task for questions with the tags Java, JavaScript and Python	Reranking	s2s	1	23018	0	3467	49.6	0	49.8
AlloprofRetrieval	lyon-nlp/alloprof	This dataset was provided by AlloProf, an organisation in Quebec, Canada offering resources and a help forum curated by a large number of teachers to students on all subjects taught from in primary and secondary school	Retrieval	s2p	1	2798	0	0	0	0	0
ArguAna	mteb/arguana	NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval	Retrieval	p2p	1	0	0	10080	0	0	1052.9
BSARDRetrieval	maastrichtlawtech/bsard	The Belgian Statutory Article Retrieval Dataset (BSARD) is a French native dataset for studying legal information retrieval. BSARD consists of more than 22,600 statutory articles from Belgian law and about 1,100 legal questions posed by Belgian citizens and labeled by experienced jurists with relevant articles from the corpus.	Retrieval	s2p	1	222	0	0	0	0	0
ClimateFEVER	mteb/climate-fever	CLIMATE-FEVER is a dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change.	Retrieval	s2p	1	0	0	5418128	0	0	539.1
CQADupstackAndroidRetrieval	mteb/cqadupstack-android	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	23697	0	0	578.7
CQADupstackEnglishRetrieval	mteb/cqadupstack-english	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	41791	0	0	467.1
CQADupstackGamingRetrieval	mteb/cqadupstack-gaming	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	46896	0	0	474.7
CQADupstackGisRetrieval	mteb/cqadupstack-gis	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	38522	0	0	991.1
CQADupstackMathematicaRetrieval	mteb/cqadupstack-mathematica	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	17509	0	0	1103.7
CQADupstackPhysicsRetrieval	mteb/cqadupstack-physics	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	39355	0	0	799.4
CQADupstackProgrammersRetrieval	mteb/cqadupstack-programmers	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	33052	0	0	1030.2
CQADupstackStatsRetrieval	mteb/cqadupstack-stats	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	42921	0	0	1041.0
CQADupstackTexRetrieval	mteb/cqadupstack-tex	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	71090	0	0	1246.9
CQADupstackUnixRetrieval	mteb/cqadupstack-unix	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	48454	0	0	984.7
CQADupstackWebmastersRetrieval	mteb/cqadupstack-webmasters	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	17911	0	0	689.8
CQADupstackWordpressRetrieval	mteb/cqadupstack-wordpress	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2p	1	0	0	49146	0	0	1111.9
DBPedia	mteb/dbpedia	DBpedia-Entity is a standard test collection for entity search over the DBpedia knowledge base	Retrieval	s2p	1	0	4635989	4636322	0	310.2	310.1
FEVER	mteb/fever	FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from.	Retrieval	s2p	1	0	0	5423234	0	0	538.6
FiQA2018	mteb/fiqa	Financial Opinion Mining and Question Answering	Retrieval	s2p	1	0	0	58286	0	0	760.4
HagridRetrieval	miracl/hagrid	HAGRID (Human-in-the-loop Attributable Generative Retrieval for Information-seeking Dataset) is a dataset for generative information-seeking scenarios. It consists of queries along with a set of manually labelled relevant passages	Retrieval	s2p	1	716	0	0	0	0	0
HotpotQA	mteb/hotpotqa	HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.	Retrieval	s2p	1	0	0	5240734	0	0	288.6
MSMARCO	mteb/msmarco	MS MARCO is a collection of datasets focused on deep learning in search. Note that the dev set is used for the leaderboard.	Retrieval	s2p	1	0	8848803	8841866	0	336.6	336.8
MSMARCOv2	mteb/msmarco-v2	MS MARCO is a collection of datasets focused on deep learning in search	Retrieval	s2p	1	138641342	138368101	0	341.4	342.0	0
NFCorpus	mteb/nfcorpus	NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval	Retrieval	s2p	1	0	0	3956	0	0	1462.7
NQ	mteb/nq	NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval	Retrieval	s2p	1	0	0	2684920	0	0	492.7
QuoraRetrieval	mteb/quora	QuoraRetrieval is based on questions that are marked as duplicates on the Quora platform. Given a question, find other (duplicate) questions.	Retrieval	s2s	1	0	0	532931	0	0	62.9
SCIDOCS	mteb/scidocs	SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation.	Retrieval	s2p	1	0	0	26657	0	0	1161.9
SciFact	mteb/scifact	SciFact verifies scientific claims using evidence from the research literature containing scientific paper abstracts.	Retrieval	s2p	1	0	0	5483	0	0	1422.3
Touche2020	mteb/touche2020	Touché Task 1: Argument Retrieval for Controversial Questions	Retrieval	s2p	1	0	0	382594	0	0	1720.1
TRECCOVID	mteb/trec-covid	TRECCOVID is an ad-hoc search challenge based on the CORD-19 dataset containing scientific articles related to the COVID-19 pandemic	Retrieval	s2p	1	0	0	171382	0	0	1117.4
ArguAna-PL	BeIR-PL/arguana-pl	NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval	Retrieval	p2p	1	0	0	10080	0	0	1052.9
DBPedia-PL	BeIR-PL/dbpedia-pl	DBpedia-Entity is a standard test collection for entity search over the DBpedia knowledge base	Retrieval	s2p	1	0	4635989	4636322	0	310.2	310.1
FiQA-PL	BeIR-PL/fiqa-pl	Financial Opinion Mining and Question Answering	Retrieval	s2p	1	0	0	58286	0	0	760.4
HotpotQA-PL	BeIR-PL/hotpotqa-pl	HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.	Retrieval	s2p	1	0	0	5240734	0	0	288.6
MSMARCO-PL	BeIR-PL/msmarco-pl	MS MARCO is a collection of datasets focused on deep learning in search. Note that the dev set is used for the leaderboard.	Retrieval	s2p	1	0	8848803	8841866	0	336.6	336.8
NFCorpus-PL	BeIR-PL/nfcorpus-pl	NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval	Retrieval	s2p	1	0	0	3956	0	0	1462.7
NQ-PL	BeIR-PL/nq-pl	Natural Questions: A Benchmark for Question Answering Research	Retrieval	s2p	1	0	0	2684920	0	0	492.7
Quora-PL	BeIR-PL/quora-pl	QuoraRetrieval is based on questions that are marked as duplicates on the Quora platform. Given a question, find other (duplicate) questions.	Retrieval	s2s	1	0	0	532931	0	0	62.9
SCIDOCS-PL	BeIR-PL/scidocs-pl	SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation.	Retrieval	s2p	1	0	0	26657	0	0	1161.9
SciFact-PL	BeIR-PL/scifact-pl	SciFact verifies scientific claims using evidence from the research literature containing scientific paper abstracts.	Retrieval	s2p	1	0	0	5483	0	0	1422.3
SweFAQ	AI-Sweden/SuperLim	Frequently asked questions from Swedish authorities' websites	Retrieval	s2p	1	0	0	513	0	0	390.57
BIOSSES	mteb/biosses-sts	Biomedical Semantic Similarity Estimation.	STS	s2s	1	0	0	200	0	0	156.6
SICK-R	mteb/sickr-sts	Semantic Textual Similarity SICK-R dataset as described here:	STS	s2s	1	0	0	19854	0	0	46.1
STS12	mteb/sts12-sts	SemEval STS 2012 dataset.	STS	s2s	1	4468	0	6216	100.7	0	64.7
STS13	mteb/sts13-sts	SemEval STS 2013 dataset.	STS	s2s	1	0	0	3000	0	0	54.0
STS14	mteb/sts14-sts	SemEval STS 2014 dataset. Currently only the English dataset	STS	s2s	1	0	0	7500	0	0	54.3
STS15	mteb/sts15-sts	SemEval STS 2015 dataset	STS	s2s	1	0	0	6000	0	0	57.7
STS16	mteb/sts16-sts	SemEval STS 2016 dataset	STS	s2s	1	0	0	2372	0	0	65.3
STS17	mteb/sts17-crosslingual-sts	STS 2017 dataset	STS	s2s	11	0	0	500	0	0	43.3
STS22	mteb/sts22-crosslingual-sts	SemEval 2022 Task 8: Multilingual News Article Similarity	STS	s2s	18	0	0	8060	0	0	1992.8
STSBenchmark	mteb/stsbenchmark-sts	Semantic Textual Similarity Benchmark (STSbenchmark) dataset.	STS	s2s	1	11498	3000	2758	57.6	64.0	53.6
SICK-R-PL	PL-MTEB/sickr-pl-sts	Polish version of SICK dataset for textual relatedness.	STS	s2s	1	8878	990	9812	42.9	44.0	42.8
CDSC-R	PL-MTEB/cdscr-sts	Compositional Distributional Semantics Corpus for textual relatedness.	STS	s2s	1	16000	2000	2000	72.1	73.2	75.0
SummEval	mteb/summeval	News Article Summary Semantic Similarity Estimation.	Summarization	s2s	1	0	0	2800	0	0	359.8

For Chinese tasks, you can refer to C_MTEB.

Project details

These details have been verified by PyPI

Maintainers

KennethEnevoldsen Muennighoff nouamanetazi nreimers

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
- Information Technology
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python

Release history Release notifications | RSS feed

1.20.0

Nov 21, 2024

1.19.10

Nov 19, 2024

1.19.9

Nov 17, 2024

1.19.8

Nov 15, 2024

1.19.7

Nov 14, 2024

1.19.6

Nov 14, 2024

1.19.5

Nov 14, 2024

1.19.4

Nov 11, 2024

1.19.3

Nov 11, 2024

1.19.2

Nov 7, 2024

1.19.1

Nov 7, 2024

1.19.0

Nov 6, 2024

1.18.9

Nov 6, 2024

1.18.8

Nov 4, 2024

1.18.7

Nov 4, 2024

1.18.6

Oct 31, 2024

1.18.5

Oct 31, 2024

1.18.4

Oct 30, 2024

1.18.3

Oct 30, 2024

1.18.2

Oct 30, 2024

1.18.1

Oct 30, 2024

1.18.0

Oct 28, 2024

1.17.0

Oct 26, 2024

1.16.5

Oct 25, 2024

1.16.4

Oct 25, 2024

1.16.3

Oct 24, 2024

1.16.2

Oct 24, 2024

1.16.1

Oct 22, 2024

1.16.0

Oct 21, 2024

1.15.8

Oct 20, 2024

1.15.7

Oct 16, 2024

1.15.6

Oct 14, 2024

1.15.5

Oct 13, 2024

1.15.4

Oct 7, 2024

1.15.3

Oct 6, 2024

1.15.2

Oct 3, 2024

1.15.1

Oct 3, 2024

1.15.0

Oct 3, 2024

1.14.26

Sep 29, 2024

1.14.25

Sep 29, 2024

1.14.24

Sep 28, 2024

1.14.23

Sep 28, 2024

1.14.22

Sep 27, 2024

1.14.21

Sep 20, 2024

1.14.20

Sep 17, 2024

1.14.19

Sep 14, 2024

1.14.18

Sep 10, 2024

1.14.17

Sep 9, 2024

1.14.16

Sep 9, 2024

1.14.15

Sep 1, 2024

1.14.14

Sep 1, 2024

1.14.13

Sep 1, 2024

1.14.12

Aug 25, 2024

1.14.11

Aug 25, 2024

1.14.10

Aug 22, 2024

1.14.9

Aug 21, 2024

1.14.8

Aug 21, 2024

1.14.7

Aug 21, 2024

1.14.6

Aug 21, 2024

1.14.5

Aug 19, 2024

1.14.4

Aug 19, 2024

1.14.3

Aug 19, 2024

1.14.2

Aug 15, 2024

1.14.1

Aug 13, 2024

1.14.0

Aug 12, 2024

1.13.2

Aug 11, 2024

1.13.1

Aug 10, 2024

1.13.0

Aug 9, 2024

1.12.94

Aug 8, 2024

1.12.93

Aug 4, 2024

1.12.92

Aug 2, 2024

1.12.91

Aug 1, 2024

1.12.90

Jul 30, 2024

1.12.89

Jul 25, 2024

1.12.88

Jul 25, 2024

1.12.87

Jul 25, 2024

1.12.86

Jul 25, 2024

1.12.85

Jul 22, 2024

1.12.84

Jul 18, 2024

1.12.83

Jul 18, 2024

1.12.82

Jul 18, 2024

1.12.81

Jul 16, 2024

1.12.80

Jul 15, 2024

1.12.79

Jul 12, 2024

1.12.78

Jul 12, 2024

1.12.77

Jul 12, 2024

1.12.76

Jul 12, 2024

1.12.75

Jul 9, 2024

1.12.74

Jul 9, 2024

1.12.73

Jul 9, 2024

1.12.72

Jul 9, 2024

1.12.71

Jul 8, 2024

1.12.70

Jul 8, 2024

1.12.69

Jul 8, 2024

1.12.68

Jul 5, 2024

1.12.67

Jul 4, 2024

1.12.66

Jul 3, 2024

1.12.65

Jul 3, 2024

1.12.64

Jul 3, 2024

1.12.63

Jul 2, 2024

1.12.62

Jul 2, 2024

1.12.61

Jul 2, 2024

1.12.60

Jul 2, 2024

1.12.59

Jun 30, 2024

1.12.58

Jun 28, 2024

1.12.57

Jun 27, 2024

1.12.56

Jun 27, 2024

1.12.55

Jun 26, 2024

1.12.54

Jun 25, 2024

1.12.53

Jun 25, 2024

1.12.52

Jun 25, 2024

1.12.51

Jun 25, 2024

1.12.50

Jun 25, 2024

1.12.49

Jun 24, 2024

1.12.48

Jun 21, 2024

1.12.47

Jun 20, 2024

1.12.46

Jun 20, 2024

1.12.45

Jun 20, 2024

1.12.44

Jun 19, 2024

1.12.43

Jun 18, 2024

1.12.42

Jun 18, 2024

1.12.41

Jun 18, 2024

1.12.40

Jun 18, 2024

1.12.39

Jun 18, 2024

1.12.38

Jun 17, 2024

1.12.37

Jun 17, 2024

1.12.36

Jun 17, 2024

1.12.35

Jun 17, 2024

1.12.34

Jun 16, 2024

1.12.33

Jun 15, 2024

1.12.32

Jun 15, 2024

1.12.31

Jun 15, 2024

1.12.30

Jun 15, 2024

1.12.29

Jun 15, 2024

1.12.28

Jun 15, 2024

1.12.27

Jun 13, 2024

1.12.26

Jun 13, 2024

1.12.25

Jun 11, 2024

1.12.24

Jun 9, 2024

1.12.23

Jun 8, 2024

1.12.22

Jun 6, 2024

1.12.21

Jun 5, 2024

1.12.20

Jun 5, 2024

1.12.19

Jun 5, 2024

1.12.18

Jun 5, 2024

1.12.17

Jun 4, 2024

1.12.16

Jun 4, 2024

1.12.15

Jun 4, 2024

1.12.14

Jun 3, 2024

1.12.13

Jun 3, 2024

1.12.12

Jun 3, 2024

1.12.11

Jun 2, 2024

1.12.10

Jun 2, 2024

1.12.9

Jun 2, 2024

1.12.8

Jun 2, 2024

1.12.7

Jun 2, 2024

1.12.6

Jun 1, 2024

1.12.5

May 29, 2024

1.12.4

May 28, 2024

1.12.3

May 27, 2024

1.12.2

May 27, 2024

1.12.1

May 27, 2024

1.12.0

May 27, 2024

1.11.19

May 25, 2024

1.11.18

May 25, 2024

1.11.17

May 24, 2024

1.11.16

May 24, 2024

1.11.15

May 24, 2024

1.11.14

May 24, 2024

1.11.13

May 23, 2024

1.11.12

May 22, 2024

1.11.11

May 22, 2024

1.11.10

May 22, 2024

1.11.9

May 22, 2024

1.11.8

May 22, 2024

1.11.7

May 22, 2024

1.11.6

May 21, 2024

1.11.5

May 21, 2024

1.11.4

May 21, 2024

1.11.3

May 21, 2024

1.11.2

May 21, 2024

1.11.1

May 21, 2024

1.11.0

May 20, 2024

1.10.18

May 20, 2024

1.10.17

May 20, 2024

1.10.16

May 20, 2024

1.10.15

May 19, 2024

1.10.14

May 19, 2024

1.10.13

May 18, 2024

1.10.12

May 18, 2024

1.10.11

May 18, 2024

1.10.10

May 17, 2024

1.10.9

May 17, 2024

1.10.8

May 17, 2024

1.10.7

May 17, 2024

1.10.6

May 17, 2024

1.10.5

May 16, 2024

1.10.4

May 16, 2024

1.10.3

May 16, 2024

1.10.2

May 16, 2024

1.10.1

May 15, 2024

1.10.0

May 14, 2024

1.9.3

May 14, 2024

1.9.2

May 14, 2024

1.9.1

May 14, 2024

1.9.0

May 13, 2024

1.8.11

May 12, 2024

1.8.10

May 12, 2024

1.8.9

May 11, 2024

1.8.8

May 11, 2024

1.8.7

May 9, 2024

1.8.6

May 8, 2024

1.8.5

May 8, 2024

1.8.4

May 8, 2024

1.8.3

May 7, 2024

1.8.2

May 6, 2024

1.8.1

May 6, 2024

1.8.0

May 5, 2024

1.7.64

May 5, 2024

1.7.63

May 5, 2024

1.7.62

May 5, 2024

1.7.61

May 5, 2024

1.7.60

May 4, 2024

1.7.59

May 4, 2024

1.7.58

May 2, 2024

1.7.57

May 2, 2024

1.7.56

May 2, 2024

1.7.55

May 2, 2024

1.7.54

May 2, 2024

1.7.53

May 2, 2024

1.7.52

May 1, 2024

1.7.51

May 1, 2024

1.7.50

Apr 30, 2024

1.7.49

Apr 30, 2024

1.7.48

Apr 30, 2024

1.7.47

Apr 30, 2024

1.7.46

Apr 29, 2024

1.7.45

Apr 29, 2024

1.7.44

Apr 29, 2024

1.7.43

Apr 29, 2024

1.7.42

Apr 29, 2024

1.7.41

Apr 28, 2024

1.7.40

Apr 28, 2024

1.7.39

Apr 28, 2024

1.7.38

Apr 27, 2024

1.7.37

Apr 27, 2024

1.7.36

Apr 26, 2024

1.7.35

Apr 26, 2024

1.7.34

Apr 26, 2024

1.7.33

Apr 26, 2024

1.7.32

Apr 25, 2024

1.7.31

Apr 25, 2024

1.7.30

Apr 25, 2024

1.7.29

Apr 25, 2024

1.7.28

Apr 25, 2024

1.7.27

Apr 24, 2024

1.7.26

Apr 24, 2024

1.7.25

Apr 24, 2024

1.7.24

Apr 24, 2024

1.7.23

Apr 24, 2024

1.7.22

Apr 24, 2024

1.7.21

Apr 24, 2024

1.7.20

Apr 24, 2024

1.7.19

Apr 24, 2024

1.7.18

Apr 24, 2024

1.7.17

Apr 23, 2024

1.7.16

Apr 23, 2024

1.7.15

Apr 23, 2024

1.7.14

Apr 23, 2024

1.7.13

Apr 23, 2024

1.7.12

Apr 23, 2024

1.7.11

Apr 23, 2024

1.7.10

Apr 23, 2024

1.7.9

Apr 23, 2024

1.7.8

Apr 23, 2024

1.7.7

Apr 23, 2024

1.7.6

Apr 22, 2024

1.7.5

Apr 22, 2024

1.7.4

Apr 21, 2024

1.7.3

Apr 21, 2024

1.7.2

Apr 21, 2024

1.7.1

Apr 21, 2024

1.7.0

Apr 20, 2024

1.6.38

Apr 20, 2024

1.6.37

Apr 20, 2024

1.6.36

Apr 19, 2024

1.6.35

Apr 19, 2024

1.6.34

Apr 19, 2024

1.6.33

Apr 19, 2024

1.6.32

Apr 19, 2024

1.6.31

Apr 19, 2024

1.6.30

Apr 19, 2024

1.6.29

Apr 19, 2024

1.6.28

Apr 19, 2024

1.6.27

Apr 19, 2024

1.6.26

Apr 19, 2024

1.6.25

Apr 18, 2024

1.6.24

Apr 18, 2024

1.6.23

Apr 18, 2024

1.6.22

Apr 18, 2024

1.6.21

Apr 18, 2024

1.6.20

Apr 18, 2024

1.6.19

Apr 18, 2024

1.6.18

Apr 18, 2024

1.6.17

Apr 18, 2024

1.6.16

Apr 17, 2024

1.6.15

Apr 17, 2024

1.6.14

Apr 17, 2024

1.6.13

Apr 17, 2024

1.6.12

Apr 17, 2024

1.6.11

Apr 16, 2024

1.6.10

Apr 15, 2024

1.6.9

Apr 15, 2024

1.6.8

Apr 15, 2024

1.6.7

Apr 15, 2024

1.6.6

Apr 15, 2024

1.6.5

Apr 15, 2024

1.6.4

Apr 15, 2024

1.6.3

Apr 14, 2024

1.6.2

Apr 12, 2024

1.6.1

Apr 11, 2024

1.6.0

Apr 10, 2024

1.5.6

Apr 10, 2024

1.5.5

Apr 9, 2024

1.5.4

Apr 8, 2024

1.5.3

Apr 8, 2024

1.5.2

Apr 4, 2024

1.5.1

Apr 3, 2024

1.5.0

Apr 2, 2024

1.4.1

Apr 1, 2024

1.4.0

Apr 1, 2024

1.3.4

Apr 1, 2024

1.3.3

Mar 31, 2024

1.3.2

Mar 29, 2024

1.3.1

Mar 26, 2024

This version

1.2.0

Mar 6, 2024

1.1.2

Feb 16, 2024

1.1.1

Sep 20, 2023

1.1.0

Jul 31, 2023

1.0.2

Mar 28, 2023

1.0.1

Nov 29, 2022

1.0.0

Oct 17, 2022

0.9.1

Oct 13, 2022

0.0.1

Jun 30, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mteb-1.2.0.tar.gz (112.2 kB view details)

Uploaded Mar 6, 2024 Source

File details

Details for the file mteb-1.2.0.tar.gz.

File metadata

Download URL: mteb-1.2.0.tar.gz
Upload date: Mar 6, 2024
Size: 112.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for mteb-1.2.0.tar.gz
Algorithm	Hash digest
SHA256	`11a706bfc2bfe0a84aeb73ccf0a506ac6cc007d44cf2606b2cef3ab068c1159b`
MD5	`d189353d905c9624524498bc5dee77e3`
BLAKE2b-256	`fe06a233621f185b7fe63aae708b0a005f035fb2b0e36be3531d4ddc9f0319cf`