Skip to main content

Evaluation

Project description

eval

Lint Tests Build Release License

e2e-nvidia-l4-x1.yml on main e2e-nvidia-l40s-x4.yml on main

Python Library for Evaluation

What is Evaluation?

Evaluation allows us to assess how a given model is performing against a set of specific tasks. This is done by running a set of standardized benchmark tests against the model. Running evaluation produces numerical scores across these various benchmarks, as well as logs excerpts/samples of the outputs the model produced during these benchmarks. Using a combination of these artifacts as reference, along with a manual smoke test, allows us to get the best idea about whether or not a model has learned and improved on something we are trying to teach it. There are 2 stages of model evaluation in the InstructLab process:

Inter-checkpoint Evaluation

This step occurs during multi-phase training. Each phase of training produces multiple different “checkpoints” of the model that are taken at various stages during the phase. At the end of each phase, we evaluate all the checkpoints in order to find the one that provides the best results. This is done as part of the InstructLab Training library.

Full-scale final Evaluation

Once training is complete, and we have picked the best checkpoint from the output of the final phase, we can run full-scale evaluation suite which runs MT-Bench, MMLU, MT-Bench Branch and MMLU Branch.

Methods of Evaluation

Below are more in-depth explanations of the suite of benchmarks we are using as methods for evaluation of models.

Multi-turn benchmark (MT-Bench)

tl;dr Full model evaluation of performance on skills

MT-Bench is a type of benchmarking that involves asking a model 80 multi-turn questions - i.e.

<Question 1> → <model’s answer 1> → <Follow-up question> → <model’s answer 2>

A “judge” model reviews the given multi-turn question, the provided model answer, and rate the answer with a score out of 10. The scores are then averaged out and the final score produced is the “MT-bench score” for that model. This benchmark assumes no factual knowledge on the model’s part. The questions are static, but do not get obsolete with time.

You can read more about MT-Bench here

MT-Bench Branch

MT-Bench Branch is an adaptation of MT-Bench that is designed to test custom skills that are added to the model with the InstructLab project. These new skills come in the form of question/answer pairs in a Git branch of the taxonomy.

MT-Bench Branch uses the user supplied seed questions to have the candidate model generate answers to, which are then judged by the judge model using the user supplied seed answers as a reference.

Massive Multitask Language Understanding (MMLU)

tl;dr Full model evaluation of performance on knowledge

MMLU is a type of benchmarking that involves a series of fact-based multiple choice questions, along with 4 options for answers. It tests if a model is able to interpret the questions correctly, along the answers, formulate its own answer, then selects the correct option out of the provided ones. The questions are designed as a set of 57 “tasks”, and each task has a given domain. The domains cover a number of topics ranging from Chemistry and Biology to US History and Math.

The performance number is then compared against the set of known correct answers for each question to determine how many the model got right. The final MMLU score is the average of its scores. This benchmark does not involve any reference/critic model, and is a completely objective benchmark. This benchmark does assume factual knowledge on the model’s part. The questions are static, therefore MMLU cannot be used to gauge the model’s knowledge on more recent topics.

InstructLab uses an implementation found here for running MMLU.

You can read more about MMLU here

MMLU Branch

MMLU Branch is an adaptation of MMLU that is designed to test custom knowledge that is being added to the model via a Git branch of the taxonomy.

A teacher model is used to generate new multiple choice questions based on the knowledge document included in the taxonomy Git branch. A “task” is then constructed that references the newly generated answer choices. These tasks are then used to score the model’s grasp on new knowledge the same way MMLU works. Generation of these tasks are done as part of the InstructLab SDG library.

Development

⚠️ Note: Must use Python version 3.10 or later.

Set up your dev environment

The following tools are required:

Optional: Use cloud-instance.sh to launch and setup an instance

scripts/infra/cloud-instance.sh ec2 launch -t g6.2xlarge
scripts/infra/cloud-instance.sh ec2 setup-rh-devenv
scripts/infra/cloud-instance.sh ec2 install-rh-nvidia-drivers
scripts/infra/cloud-instance.sh ec2 ssh sudo reboot
scripts/infra/cloud-instance.sh ec2 ssh

Regardless of how you setup your instance

git clone https://github.com/instructlab/taxonomy.git && pushd taxonomy && git branch rc && popd
git clone --bare https://github.com/instructlab/eval.git && git clone eval.git/ && cd eval && git remote add syncrepo ../eval.git
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -e .
pip install vllm

Testing

Before pushing changes to GitHub, you need to run the tests as shown below. They can be run individually as shown in each sub-section or can be run with the one command:

tox

Unit tests

Unit tests are enforced by the CI system using pytest. When making changes, run these tests before pushing the changes to avoid CI issues.

Running unit tests can be done with:

tox -e py3-unit

By default, all tests found within the tests directory are run. However, specific unit tests can run by passing filenames, classes and/or methods to pytest using tox positional arguments. The following example invokes a single test method test_mt_bench that is declared in the tests/test_mt_bench.py file:

tox -e py3-unit -- tests/test_mt_bench.py::test_mt_bench

Functional tests

Functional tests are enforced by the CI system. When making changes, run the tests before pushing the changes to avoid CI issues.

Running functional tests can be done with:

tox -e py3-functional

Coding style

Cli follows the python pep8 coding style. The coding style is enforced by the CI system, and your PR will fail until the style has been applied correctly.

We use pre-commit to enforce coding style using black, and isort.

You can invoke formatting with:

tox -e ruff

In addition, we use pylint to perform static code analysis of the code.

You can invoke the linting with the following command

tox -e lint

MT-Bench / MT-Bench Branch Example Usage

Launch vllm serving granite-7b-lab

python -m vllm.entrypoints.openai.api_server --model instructlab/granite-7b-lab --tensor-parallel-size 1

In another shell window

export INSTRUCTLAB_EVAL_FIRST_N_QUESTIONS=10 # Optional if you want to shorten run times
# Commands relative to eval directory
python3 scripts/test_gen_answers.py
python3 scripts/test_branch_gen_answers.py

Example output tree

eval_output/
├── mt_bench
│   └── model_answer
│       └── instructlab
│           └── granite-7b-lab.jsonl
└── mt_bench_branch
    ├── main
       ├── model_answer
          └── instructlab
              └── granite-7b-lab.jsonl
       ├── question.jsonl
       └── reference_answer
           └── instructlab
               └── granite-7b-lab.jsonl
    └── rc
        ├── model_answer
           └── instructlab
               └── granite-7b-lab.jsonl
        ├── question.jsonl
        └── reference_answer
            └── instructlab
                └── granite-7b-lab.jsonl
python3 scripts/test_judge_answers.py
python3 scripts/test_branch_judge_answers.py

Example output tree

eval_output/
├── mt_bench
│   ├── model_answer
│      └── instructlab
│          └── granite-7b-lab.jsonl
│   └── model_judgment
│       └── instructlab
│           └── granite-7b-lab_single.jsonl
└── mt_bench_branch
    ├── main
       ├── model_answer
          └── instructlab
              └── granite-7b-lab.jsonl
       ├── model_judgment
          └── instructlab
              └── granite-7b-lab_single.jsonl
       ├── question.jsonl
       └── reference_answer
           └── instructlab
               └── granite-7b-lab.jsonl
    └── rc
        ├── model_answer
           └── instructlab
               └── granite-7b-lab.jsonl
        ├── model_judgment
           └── instructlab
               └── granite-7b-lab_single.jsonl
        ├── question.jsonl
        └── reference_answer
            └── instructlab
                └── granite-7b-lab.jsonl

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

instructlab_eval-0.4.0.tar.gz (108.6 kB view details)

Uploaded Source

Built Distribution

instructlab_eval-0.4.0-py3-none-any.whl (66.3 kB view details)

Uploaded Python 3

File details

Details for the file instructlab_eval-0.4.0.tar.gz.

File metadata

  • Download URL: instructlab_eval-0.4.0.tar.gz
  • Upload date:
  • Size: 108.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for instructlab_eval-0.4.0.tar.gz
Algorithm Hash digest
SHA256 6a067843cc12744d1da50d87dfc5ad9aae04a2c898fdd2e493f3c35256b2f11b
MD5 398b2f99463717d0f659a38d5dae80f7
BLAKE2b-256 8060fb2e401996ad2c859715c070daf19ae5a508eecaf4cc0493e86be0980d60

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructlab_eval-0.4.0.tar.gz:

Publisher: GitHub
  • Repository: instructlab/eval
  • Workflow: pypi.yaml
Attestations:
  • Statement type: https://in-toto.io/Statement/v1
    • Predicate type: https://docs.pypi.org/attestations/publish/v1
    • Subject name: instructlab_eval-0.4.0.tar.gz
    • Subject digest: 6a067843cc12744d1da50d87dfc5ad9aae04a2c898fdd2e493f3c35256b2f11b
    • Transparency log index: 148519448
    • Transparency log integration time:

File details

Details for the file instructlab_eval-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for instructlab_eval-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0b59e432d1598a189dc6d0325869a95224540f07961a921f4f90286c39665a8c
MD5 90e0f716ab1cc625ef528681d05b2241
BLAKE2b-256 ad42f696f409b689264c26ec7e4f542cdb7564449abeb33e3e656a9736fb09b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for instructlab_eval-0.4.0-py3-none-any.whl:

Publisher: GitHub
  • Repository: instructlab/eval
  • Workflow: pypi.yaml
Attestations:
  • Statement type: https://in-toto.io/Statement/v1
    • Predicate type: https://docs.pypi.org/attestations/publish/v1
    • Subject name: instructlab_eval-0.4.0-py3-none-any.whl
    • Subject digest: 0b59e432d1598a189dc6d0325869a95224540f07961a921f4f90286c39665a8c
    • Transparency log index: 148519449
    • Transparency log integration time:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page