This repository is designed to simplify the evaluation process of vision-language models. It provides a comprehensive set of tools and scripts for evaluating VLM models and benchmarks.

These details have not been verified by PyPI

Project links

Project description


[Arxiv link]

Getting Started • Usage • Benchmarks & Models • Credit & Citation

Vision-Language Model Evaluation Repository

This repository is designed to simplify the evaluation process of vision-language models. It provides a comprehensive set of tools and scripts for evaluating VLM models and benchmarks. We offer 60 VLMs, inclusive of recent large-scale models like EVACLIP, with scales reaching up to 4.3B parameters and 12.8B training samples. Additionally, we provide implementations for 40 evaluation benchmarks.

Coming Soon

L-VLM (e.g. PaliGemma, LlavaNext)

Getting Started

Install the package:

pip install unibench -U

[option 2] Install Dependencies

Install the necessary dependencies by:
- Option 1, creating a new conda env: conda env create -f environment.yml
- Option 2, updating your conda env with required libraries: conda env update --file environment.yml --prune
Activate the environment: conda activate unibench
Install Spacy english language model: python -m spacy download en_core_web_sm
Install the package: pip install git+https://github.com/facebookresearch/unibench

Usage

Print out Results from Evaluated Models

The following command will print the results of the evaluations on all benchmarks and models:

unibench show_results

Run Evaluation using Command Line

The following command will run the evaluation on all benchmarks and models:

unibench evaluate

Run Evaluation using Custom Script

The following command will run the evaluation on all benchmarks and models:

import unibench as vlm

evaluator = vlm.Evaluator()
evaluator.evaluate()

Arguments for Evaluation

evaluate function takes the following arguments:

Args:
    save_freq (int): The frequency at which to save results. Defaults to 1000.
    face_blur (bool): Whether to use face blurring during evaluation. Defaults to False.
    device (str): The device to use for evaluation. Defaults to "cuda" if available otherwise "cpu".
    batch_per_gpu (int): Evaluation batch size per GPU. Defaults to 32.

The Evaluator class takes the following arguments:

Args:
    seed (int): Random seed for reproducibility.
    num_workers (int): Number of workers for data loading.
    models (Union[List[str], str]): List of models to evaluate or "all" to evaluate all available models.
    benchmarks (Union[List[str], str]): List of benchmarks to evaluate or "all" to evaluate all available benchmarks.
    model_id (Union[int, None]): Specific model ID to evaluate.
    benchmark_id (Union[int, None]): Specific benchmark ID to evaluate.
    output_dir (str): Directory to save evaluation results.
    benchmarks_dir (str): Directory containing benchmark data.
    download_aggregate_precomputed (bool): Whether to download aggregate precomputed results.
    download_all_precomputed (bool): Whether to download all precomputed results.

Example

The following command will run the evaluation for openclip_vitB32 trained on metaclip400m and CLIP ResNet50 on vg_relation,clevr_distance,fer2013,pcam,imageneta benchmarks:

unibench evaluate --models=[openclip_vitB32_metaclip_400m,clip_resnet50] --benchmarks=[vg_relation,clevr_distance,fer2013,pcam,imageneta]

In addition to saving the results in ~/.cache/unibench, the output would be a summary of the evaluation results:

  model_name                      non-natural images   reasoning   relation   robustness  
 ──────────────────────────────────────────────────────────────────────────────────────── 
  clip_resnet50                   63.95                 14.89       54.13      23.27       
  openclip_vitB32_metaclip_400m   63.87                 19.46       51.54      28.71

Supported Models and benchmarks

Full list of models and benchmarks are available in the models_zoo and benchmarks_zoo. You are also able to run the following commands:

unibench list_models
# or
unibench list_benchmarks

Sample Models

	Dataset Size (Million)	Number of Parameters (Million)	Learning Objective	Architecture	Model Name
blip_vitB16_14m	14	86	BLIP	vit	BLIP ViT B 16
blip_vitL16_129m	129	307	BLIP	vit	BLIP ViT L 16
blip_vitB16_129m	129	86	BLIP	vit	BLIP ViT B 16
blip_vitB16_coco	129	86	BLIP	vit	BLIP ViT B 16
blip_vitB16_flickr	129	86	BLIP	vit	BLIP ViT B 16

Sample benchmarks

	benchmark	benchmark_type
clevr_distance	zero-shot	vtab
fgvc_aircraft	zero-shot	transfer
objectnet	zero-shot	robustness
winoground	relation	relation
imagenetc	zero-shot	corruption

benchmarks Overview

benchmark type	number of benchmarks
ImageNet	1
vtab	18
transfer	7
robustness	6
relation	6
corruption	1

How results are saved

For each model, the results are saved in the output directory defined in constants: ~./.cache/unibench/outputs.

Add new Benchmark

To add new benchmark, you can simply inherit from the torch.utils.data.Dataset class and implement the __getitem__, and __len__ methods. For example, here is how to add ImageNetA as a new benchmark:

from functools import partial
from unibench import Evaluator
from unibench.benchmarks_zoo import ZeroShotBenchmarkHandler
from torchvision.datasets import FashionMNIST

class_names = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

templates = ["an image of {}"]

benchmark = partial(
    FashionMNIST, root="/fsx-robust/haideraltahan", train=False, download=True
)
handler = partial(
    ZeroShotBenchmarkHandler,
    benchmark_name="fashion_mnist_new",
    classes=class_names,
    templates=templates,
)


eval = Evaluator()

eval.add_benchmark(
    benchmark,
    handler,
    meta_data={
        "benchmark_type": "object recognition",
    },
)
eval.update_benchmark_list(["fashion_mnist_new"])
eval.update_model_list(["blip_vitB16_129m"])
eval.evaluate()

Add new Model

The most important compontent of adding a new model is creating or using pre-existing AbstractModel and implementing compute_zeroshot_weights, get_image_embeddings, and get_text_embeddings, similar to how ClipModel works:

class ClipModel(AbstractModel):
    def __init__(
        self,
        model,
        model_name,
        **kwargs,
    ):
        super(ClipModel, self).__init__(model, model_name, **kwargs)

    def compute_zeroshot_weights(self):
        zeroshot_weights = []
        for class_name in self.classes:
            texts = [template.format(class_name) for template in self.templates]

            class_embedding = self.get_text_embeddings(texts)

            class_embedding = class_embedding.mean(dim=0)
            class_embedding /= class_embedding.norm(dim=-1, keepdim=True)

            zeroshot_weights.append(class_embedding)
        self.zeroshot_weights = torch.stack(zeroshot_weights).T

    @torch.no_grad()
    def get_image_embeddings(self, images):
        image_features = self.model.encode_image(images.to(self.device))
        image_features /= image_features.norm(dim=1, keepdim=True)
        return image_features.unsqueeze(1)

    @torch.no_grad()
    def get_text_embeddings(self, captions):
        if (
            "truncate" in inspect.getfullargspec(self.tokenizer.__call__)[0]
            or "truncate" in inspect.getfullargspec(self.tokenizer)[0]
        ):
            caption_tokens = self.tokenizer(
                captions, context_length=self.context_length, truncate=True
            ).to(self.device)
        else:
            caption_tokens = self.tokenizer(
                captions, context_length=self.context_length
            ).to(self.device)

        caption_embeddings = self.model.encode_text(caption_tokens)
        caption_embeddings /= caption_embeddings.norm(dim=-1, keepdim=True)

        return caption_embeddings

Using the following class, we can then add models to the list of models. Here we have an example of adding and evaluating ViTamin-L.

from functools import partial
from io import open_code
from unibench import Evaluator
from unibench.models_zoo.wrappers.clip import ClipModel
import open_clip

model, _, _ = open_clip.create_model_and_transforms(
    "ViTamin-L", pretrained="datacomp1b"
)

tokenizer = open_clip.get_tokenizer("ViTamin-L")

model = partial(
    ClipModel,
    model=model,
    model_name="vitamin_l_comp1b",
    tokenizer=tokenizer,
    input_resolution=model.visual.image_size[0],
    logit_scale=model.logit_scale,
)


eval = Evaluator(benchmarks_dir="/fsx-checkpoints/haideraltahan/.cache/unibench/data")

eval.add_model(model=model)
eval.update_benchmark_list(["imagenet1k"])
eval.update_model_list(["vitamin_l_comp1b"])
eval.evaluate()

Contributing

Contributions (e.g. adding new benchmarks/models), issues, and feature requests are welcome! For any changes, please open an issue first to discuss what you would like to change or improve.

License

The majority of UniBench is licensed under CC-BY-NC, however portions of the project are available under separate license terms:

License	Libraries
MIT license	zipp, tabulate, rich, openai-clip, latextable, gdown
Apache 2.0 license	transformers, timm, opencv-python, open-clip-torch, ftfy, fire, debtcollector, datasets, oslo.concurrency
BSD license	torchvision, torch, seaborn, scipy, scikit-learn, fairscale, cycler, contourpy, click, GitPython

Citation

If you use this repository in your research, please cite it as follows:

@inproceedings{altahan2024unibenchvisualreasoningrequires,
      title={UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling}, 
      author={Haider Al-Tahan and Quentin Garrido and Randall Balestriero and Diane Bouchacourt and Caner Hazirbas and Mark Ibrahim},
      year={2024},
      eprint={2408.04810},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04810}, 
}

Recognition

Library structure was inspired by Robert Geirhos's work https://github.com/bethgelab/model-vs-human

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.1

Sep 26, 2024

0.3.0

Sep 3, 2024

0.2.0

Aug 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unibench-0.3.1.tar.gz (93.9 kB view details)

Uploaded Sep 26, 2024 Source

Built Distribution

unibench-0.3.1-py3-none-any.whl (104.9 kB view details)

Uploaded Sep 26, 2024 Python 3

File details

Details for the file unibench-0.3.1.tar.gz.

File metadata

Download URL: unibench-0.3.1.tar.gz
Upload date: Sep 26, 2024
Size: 93.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for unibench-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`71434a4aa0bb3c777568853b57a681e45bebee8b029caa6c141842f803c1acf4`
MD5	`db7160738655ac5e59fecbb82ee340a3`
BLAKE2b-256	`7990cf11ecd5bb83254fc138cb264b17a68b9fdd4c0ac6a022ceff121abe7a00`

See more details on using hashes here.

File details

Details for the file unibench-0.3.1-py3-none-any.whl.

File metadata

Download URL: unibench-0.3.1-py3-none-any.whl
Upload date: Sep 26, 2024
Size: 104.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for unibench-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f7a4fa7bc73116c661e19e2e4965aa2f0bb64d8f0a286c4c16b1bff363fd8e04`
MD5	`f4760280f9d4809584c424f15689b910`
BLAKE2b-256	`71996aa75a6b18326b73b39d42f48d804e8de1343c7bf55a0db5c667fa7b20c1`