Skip to main content

This repository is designed to simplify the evaluation process of vision-language models. It provides a comprehensive set of tools and scripts for evaluating VLM models and benchmarks.

Project description

header
[Arxiv link]

Getting StartedUsageBenchmarks & ModelsCredit & Citation

Vision-Language Model Evaluation Repository

This repository is designed to simplify the evaluation process of vision-language models. It provides a comprehensive set of tools and scripts for evaluating VLM models and benchmarks. We offer 60 VLMs, inclusive of recent large-scale models like EVACLIP, with scales reaching up to 4.3B parameters and 12.8B training samples. Additionally, we provide implementations for 40 evaluation benchmarks.

Coming Soon

  • L-VLM (e.g. PaliGemma, LlavaNext)

Getting Started

Install the package:

pip install unibench -U
[option 2] Install Dependencies
  1. Install the necessary dependencies by:
    • Option 1, creating a new conda env: conda env create -f environment.yml
    • Option 2, updating your conda env with required libraries: conda env update --file environment.yml --prune
  2. Activate the environment: conda activate unibench
  3. Install Spacy english language model: python -m spacy download en_core_web_sm
  4. Install the package: pip install git+https://github.com/fairinternal/unibench

Usage

Print out Results from Evaluated Models

The following command will print the results of the evaluations on all benchmarks and models:

unibench show_results

Run Evaluation using Command Line

The following command will run the evaluation on all benchmarks and models:

unibench evaluate

Run Evaluation using Custom Script

The following command will run the evaluation on all benchmarks and models:

import unibench as vlm

evaluator = vlm.Evaluator()
evaluator.evaluate()

Arguments for Evaluation

evaluate function takes the following arguments:

Args:
    save_freq (int): The frequency at which to save results. Defaults to 1000.
    face_blur (bool): Whether to use face blurring during evaluation. Defaults to False.
    device (str): The device to use for evaluation. Defaults to "cuda" if available otherwise "cpu".
    batch_per_gpu (int): Evaluation batch size per GPU. Defaults to 32.

The Evaluator class takes the following arguments:

Args:
    seed (int): Random seed for reproducibility.
    num_workers (int): Number of workers for data loading.
    models (Union[List[str], str]): List of models to evaluate or "all" to evaluate all available models.
    benchmarks (Union[List[str], str]): List of benchmarks to evaluate or "all" to evaluate all available benchmarks.
    model_id (Union[int, None]): Specific model ID to evaluate.
    benchmark_id (Union[int, None]): Specific benchmark ID to evaluate.
    output_dir (str): Directory to save evaluation results.
    benchmarks_dir (str): Directory containing benchmark data.
    download_aggregate_precomputed (bool): Whether to download aggregate precomputed results.
    download_all_precomputed (bool): Whether to download all precomputed results.

Example

The following command will run the evaluation for openclip_vitB32 trained on metaclip400m and CLIP ResNet50 on vg_relation,clevr_distance,fer2013,pcam,imageneta benchmarks:

unibench evaluate --models=[openclip_vitB32_metaclip_400m,clip_resnet50] --benchmarks=[vg_relation,clevr_distance,fer2013,pcam,imageneta]

In addition to saving the results in ~/.cache/unibench, the output would be a summary of the evaluation results:

  model_name                      non-natural images   reasoning   relation   robustness  
 ──────────────────────────────────────────────────────────────────────────────────────── 
  clip_resnet50                   63.95                 14.89       54.13      23.27       
  openclip_vitB32_metaclip_400m   63.87                 19.46       51.54      28.71   

Supported Models and benchmarks

Full list of models and benchmarks are available in the models_zoo and benchmarks_zoo. You are also able to run the following commands:

unibench list_models
# or
unibench list_benchmarks

Sample Models

Dataset Size (Million) Number of Parameters (Million) Learning Objective Architecture Model Name
blip_vitB16_14m 14 86 BLIP vit BLIP ViT B 16
blip_vitL16_129m 129 307 BLIP vit BLIP ViT L 16
blip_vitB16_129m 129 86 BLIP vit BLIP ViT B 16
blip_vitB16_coco 129 86 BLIP vit BLIP ViT B 16
blip_vitB16_flickr 129 86 BLIP vit BLIP ViT B 16

Sample benchmarks

benchmark benchmark_type
clevr_distance zero-shot vtab
fgvc_aircraft zero-shot transfer
objectnet zero-shot robustness
winoground relation relation
imagenetc zero-shot corruption

benchmarks Overview

benchmark type number of benchmarks
ImageNet 1
vtab 18
transfer 7
robustness 6
relation 6
corruption 1

How results are saved

For each model, the results are saved in the output directory defined in constants: ~./.cache/unibench/outputs.

Add new Benchmark

To add new benchmark, you can simply inherit from the torch.utils.data.Dataset class and implement the __getitem__, and __len__ methods. For example, here is how to add ImageNetA as a new benchmark:

from functools import partial
from unibench import Evaluator
from unibench.benchmarks_zoo import ZeroShotBenchmarkHandler
from torchvision.datasets import FashionMNIST

class_names = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

templates = ["an image of {}"]

benchmark = partial(
    FashionMNIST, root="/fsx-robust/haideraltahan", train=False, download=True
)
handler = partial(
    ZeroShotBenchmarkHandler,
    benchmark_name="fashion_mnist_new",
    classes=class_names,
    templates=templates,
)


eval = Evaluator()

eval.add_benchmark(
    benchmark,
    handler,
    meta_data={
        "benchmark_type": "object recognition",
    },
)
eval.update_benchmark_list(["fashion_mnist_new"])
eval.update_model_list(["blip_vitB16_129m"])
eval.evaluate()

Add new Model

The most important compontent of adding a new model is creating or using pre-existing AbstractModel and implementing compute_zeroshot_weights, get_image_embeddings, and get_text_embeddings, similar to how ClipModel works:

class ClipModel(AbstractModel):
    def __init__(
        self,
        model,
        model_name,
        **kwargs,
    ):
        super(ClipModel, self).__init__(model, model_name, **kwargs)

    def compute_zeroshot_weights(self):
        zeroshot_weights = []
        for class_name in self.classes:
            texts = [template.format(class_name) for template in self.templates]

            class_embedding = self.get_text_embeddings(texts)

            class_embedding = class_embedding.mean(dim=0)
            class_embedding /= class_embedding.norm(dim=-1, keepdim=True)

            zeroshot_weights.append(class_embedding)
        self.zeroshot_weights = torch.stack(zeroshot_weights).T

    @torch.no_grad()
    def get_image_embeddings(self, images):
        image_features = self.model.encode_image(images.to(self.device))
        image_features /= image_features.norm(dim=1, keepdim=True)
        return image_features.unsqueeze(1)

    @torch.no_grad()
    def get_text_embeddings(self, captions):
        if (
            "truncate" in inspect.getfullargspec(self.tokenizer.__call__)[0]
            or "truncate" in inspect.getfullargspec(self.tokenizer)[0]
        ):
            caption_tokens = self.tokenizer(
                captions, context_length=self.context_length, truncate=True
            ).to(self.device)
        else:
            caption_tokens = self.tokenizer(
                captions, context_length=self.context_length
            ).to(self.device)

        caption_embeddings = self.model.encode_text(caption_tokens)
        caption_embeddings /= caption_embeddings.norm(dim=-1, keepdim=True)

        return caption_embeddings

Using the following class, we can then add models to the list of models. Here we have an example of adding and evaluating ViTamin-L.

from functools import partial
from io import open_code
from unibench import Evaluator
from unibench.models_zoo.wrappers.clip import ClipModel
import open_clip

model, _, _ = open_clip.create_model_and_transforms(
    "ViTamin-L", pretrained="datacomp1b"
)

tokenizer = open_clip.get_tokenizer("ViTamin-L")

model = partial(
    ClipModel,
    model=model,
    model_name="vitamin_l_comp1b",
    tokenizer=tokenizer,
    input_resolution=model.visual.image_size[0],
    logit_scale=model.logit_scale,
)


eval = Evaluator(benchmarks_dir="/fsx-checkpoints/haideraltahan/.cache/unibench/data")

eval.add_model(model=model)
eval.update_benchmark_list(["imagenet1k"])
eval.update_model_list(["vitamin_l_comp1b"])
eval.evaluate()

Contributing

Contributions (e.g. adding new benchmarks/models), issues, and feature requests are welcome! For any changes, please open an issue first to discuss what you would like to change or improve.

License

The majority of UniBench is licensed under CC-BY-NC, however portions of the project are available under separate license terms:

License Libraries
MIT license zipp, tabulate, rich, openai-clip, latextable, gdown
Apache 2.0 license transformers, timm, opencv-python, open-clip-torch, ftfy, fire, debtcollector, datasets, oslo.concurrency
BSD license torchvision, torch, seaborn, scipy, scikit-learn, fairscale, cycler, contourpy, click, GitPython

Citation

If you use this repository in your research, please cite it as follows:

@inproceedings{altahan2024unibenchvisualreasoningrequires,
      title={UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling}, 
      author={Haider Al-Tahan and Quentin Garrido and Randall Balestriero and Diane Bouchacourt and Caner Hazirbas and Mark Ibrahim},
      year={2024},
      eprint={2408.04810},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04810}, 
}

Recognition

Library structure was inspired by Robert Geirhos's work https://github.com/bethgelab/model-vs-human

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unibench-0.3.0.tar.gz (93.8 kB view details)

Uploaded Source

Built Distribution

unibench-0.3.0-py3-none-any.whl (104.9 kB view details)

Uploaded Python 3

File details

Details for the file unibench-0.3.0.tar.gz.

File metadata

  • Download URL: unibench-0.3.0.tar.gz
  • Upload date:
  • Size: 93.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for unibench-0.3.0.tar.gz
Algorithm Hash digest
SHA256 9879232004514590c47f40d797174df56114cbfc30fa0eb97dec5bcc0a1081cc
MD5 2c7b7f697472be728270e7bb4da76cca
BLAKE2b-256 5bfdba6ca3109a67c0077bb1f77288ca2802ddaaf83fdce7301e832e5dafef28

See more details on using hashes here.

Provenance

File details

Details for the file unibench-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: unibench-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 104.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for unibench-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7016fe2db101fd82d3552d9823aed6e31faf118f43a03d0233855d9e8f78cff9
MD5 c2f3b426360203a871143466cac17b85
BLAKE2b-256 f077c8eee478ce6537c6c566d4f090219078ccff5c3a2bd71b07050d3341d7ca

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page