Skip to main content

JSON schema and validation code for HEPData submissions

Project description

GitHub Actions Build Status Coveralls Status License GitHub Releases PyPI Version GitHub Issues Documentation Status

JSON schema and validation code (in Python 3) for HEPData submissions

Installation

If you can, install LibYAML (a C library for parsing and emitting YAML) on your machine. This will allow for the use of CSafeLoader (instead of Python SafeLoader) for faster loading of YAML files. Not a big deal for small files, but performs markedly better on larger documents.

Install from PyPI using pip:

$ pip install --user hepdata-validator
$ hepdata-validate --help

If you would like to use LibYAML, you may need an additional step if running on an M1 Mac, to ensure pyyaml is built with the LibYAML bindings. Run the following after installing LibYAML via Homebrew:

$ LDFLAGS="-L$(brew --prefix)/lib" CFLAGS="-I$(brew --prefix)/include" pip install --global-option="--with-libyaml" --force pyyaml

Developers

Developers should install from GitHub in a virtual environment:

$ git clone https://github.com/HEPData/hepdata-validator
$ cd hepdata-validator
$ python3 -m venv ~/venv/hepdata-validator
$ source ~/venv/hepdata-validator/bin/activate
(hepdata-validator) $ pip install --upgrade -e ".[tests]"

Tests should be run both with and without LibYAML, as error messages from the different YAML parsers vary:

(hepdata-validator) $ USE_LIBYAML=True pytest testsuite
(hepdata-validator) $ USE_LIBYAML=False pytest testsuite

Usage

The hepdata-validator package allows you to validate (via the command line or Python):

  • A full directory of submission and data files

  • An archive file (.zip, .tar, .tar.gz, .tgz) containing all of the files (full details)

  • A single .yaml or .yaml.gz file (but not submission.yaml or a YAML data file)

  • A submission.yaml file or individual YAML data file (via Python only, not via the command line)

The same package is used for validating uploads made to hepdata.net, therefore first validating offline can be more efficient in checking your submission is valid before uploading.

Command line

Installing the hepdata-validator package adds the command hepdata-validate to your path, which allows you to validate a HEPData submission offline.

Examples

To validate a submission comprising of multiple files in the current directory:

$ hepdata-validate

To validate a submission comprising of multiple files in another directory:

$ hepdata-validate -d ../TestHEPSubmission

To validate an archive file (.zip, .tar, .tar.gz, .tgz) in the current directory:

$ hepdata-validate -a TestHEPSubmission.zip

To validate a single YAML file in the current directory:

$ hepdata-validate -f single_yaml_file.yaml

Usage options

$ hepdata-validate --help
Usage: hepdata-validate [OPTIONS]

  Offline validation of submission.yaml and YAML data files. Can check either
  a directory, an archive file, or the single YAML file format.

Options:
  -d, --directory TEXT  Directory to check (defaults to current working
                        directory)
  -f, --file TEXT       Single .yaml or .yaml.gz file (but not submission.yaml
                        or a YAML data file) to check - see https://hepdata-
                        submission.readthedocs.io/en/latest/single_yaml.html.
                        (Overrides directory)
  -a, --archive TEXT    Archive file (.zip, .tar, .tar.gz, .tgz) to check.
                        (Overrides directory and file)
  --help                Show this message and exit.

Python

Validating a full submission

To validate a full submission, instantiate a FullSubmissionValidator object:

from hepdata_validator.full_submission_validator import FullSubmissionValidator, SchemaType
full_submission_validator = FullSubmissionValidator()

# validate a directory
is_dir_valid = full_submission_validator.validate(directory='TestHEPSubmission')

# or uncomment to validate an archive file
# is_archive_valid = full_submission_validator.validate(archive='TestHEPSubmission.zip')

# or uncomment to validate a single file
# is_file_valid = full_submission_validator.validate(file='single_yaml_file.yaml')

# if there are any error messages, they are retrievable through this call
full_submission_validator.get_messages()

# the error messages can be printed for each file
full_submission_validator.print_errors('submission.yaml')

# the list of valid files can be retrieved via the valid_files property, which is a
# dict mapping SchemaType (e.g. SUBMISSION, DATA, SINGLE_YAML, REMOTE) to lists of
# valid files
full_submission_validator.valid_files[SchemaType.SUBMISSION]
full_submission_validator.valid_files[SchemaType.DATA]
# full_submission_validator.valid_files[SchemaType.SINGLE_YAML]

# if a remote schema is used, valid_files is a list of tuples (schema, file)
# full_submission_validator.valid_files[SchemaType.REMOTE]

# the list of valid files can be printed
full_submission_validator.print_valid_files()

Validating individual files

To validate submission files, instantiate a SubmissionFileValidator object:

from hepdata_validator.submission_file_validator import SubmissionFileValidator

submission_file_validator = SubmissionFileValidator()
submission_file_path = 'submission.yaml'

# the validate method takes a string representing the file path
is_valid_submission_file = submission_file_validator.validate(file_path=submission_file_path)

# if there are any error messages, they are retrievable through this call
submission_file_validator.get_messages()

# the error messages can be printed
submission_file_validator.print_errors(submission_file_path)

To validate data files, instantiate a DataFileValidator object:

from hepdata_validator.data_file_validator import DataFileValidator

data_file_validator = DataFileValidator()

# the validate method takes a string representing the file path
data_file_validator.validate(file_path='data.yaml')

# if there are any error messages, they are retrievable through this call
data_file_validator.get_messages()

# the error messages can be printed
data_file_validator.print_errors('data.yaml')

Optionally, if you have already loaded the YAML object, then you can pass it through as a data object. You must also pass through the file_path since this is used as a key for the error message lookup map.

from hepdata_validator.data_file_validator import DataFileValidator
import yaml

file_contents = yaml.safe_load(open('data.yaml', 'r'))
data_file_validator = DataFileValidator()

data_file_validator.validate(file_path='data.yaml', data=file_contents)

data_file_validator.get_messages('data.yaml')

data_file_validator.print_errors('data.yaml')

For the analogous case of the SubmissionFileValidator:

from hepdata_validator.submission_file_validator import SubmissionFileValidator
import yaml
submission_file_path = 'submission.yaml'

# convert a generator returned by yaml.safe_load_all into a list
docs = list(yaml.safe_load_all(open(submission_file_path, 'r')))

submission_file_validator = SubmissionFileValidator()
is_valid_submission_file = submission_file_validator.validate(file_path=submission_file_path, data=docs)
submission_file_validator.print_errors(submission_file_path)

Schema Versions

When considering native HEPData JSON schemas, there are multiple versions. In most cases you should use the latest version (the default). If you need to use a different version, you can pass a keyword argument schema_version when initialising the validator:

submission_file_validator = SubmissionFileValidator(schema_version='0.1.0')
data_file_validator = DataFileValidator(schema_version='0.1.0')

Remote Schemas

When using remotely defined schemas, versions depend on the organization providing those schemas, and it is their responsibility to offer a way of keeping track of different schema versions.

The JsonSchemaResolver object resolves $ref in the JSON schema. The HTTPSchemaDownloader object retrieves schemas from a remote location, and optionally saves them in the local file system, following the structure: schemas_remote/<org>/<project>/<version>/<schema_name>. An example may be:

from hepdata_validator.data_file_validator import DataFileValidator
data_validator = DataFileValidator()

# Split remote schema path and schema name
schema_path = 'https://scikit-hep.org/pyhf/schemas/1.0.0/'
schema_name = 'workspace.json'

# Create JsonSchemaResolver object to resolve $ref in JSON schema
from hepdata_validator.schema_resolver import JsonSchemaResolver
pyhf_resolver = JsonSchemaResolver(schema_path)

# Create HTTPSchemaDownloader object to validate against remote schema
from hepdata_validator.schema_downloader import HTTPSchemaDownloader
pyhf_downloader = HTTPSchemaDownloader(pyhf_resolver, schema_path)

# Retrieve and save the remote schema in the local path
pyhf_type = pyhf_downloader.get_schema_type(schema_name)
pyhf_spec = pyhf_downloader.get_schema_spec(schema_name)
pyhf_downloader.save_locally(schema_name, pyhf_spec)

# Load the custom schema as a custom type
import os
pyhf_path = os.path.join(pyhf_downloader.schemas_path, schema_name)
data_validator.load_custom_schema(pyhf_type, pyhf_path)

# Validate a specific schema instance
data_validator.validate(file_path='pyhf_workspace.json', file_type=pyhf_type)

The native HEPData JSON schema are provided as part of the hepdata-validator package and it is not necessary to download them. However, in principle, for testing purposes, note that the same mechanism above could be used with:

schema_path = 'https://hepdata.net/submission/schemas/1.1.0/'
schema_name = 'data_schema.json'

and passing a HEPData YAML data file as the file_path argument of the validate method.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hepdata_validator-0.3.3.tar.gz (26.6 kB view details)

Uploaded Source

Built Distribution

hepdata_validator-0.3.3-py2.py3-none-any.whl (41.6 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file hepdata_validator-0.3.3.tar.gz.

File metadata

  • Download URL: hepdata_validator-0.3.3.tar.gz
  • Upload date:
  • Size: 26.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.12

File hashes

Hashes for hepdata_validator-0.3.3.tar.gz
Algorithm Hash digest
SHA256 dccdf2ba58bac78e879145a2ff31d299e7b20bd7f28b575ab9d07b950ab723ae
MD5 7fd534a770d7d3371a32db5c6ef164ff
BLAKE2b-256 7729a898e5aefd4a88dacc38ba8d9d8a7e3c58b7e405549cb49c3d059e1b7f0f

See more details on using hashes here.

File details

Details for the file hepdata_validator-0.3.3-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for hepdata_validator-0.3.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 50581e64199674117b4ba1eded5feb68755f5b412be30ac044f606ac207a6dd8
MD5 dcfca92d632c44b5d8a45f178a657e54
BLAKE2b-256 b6e80857e28b5b268f3b65aad0a4e3c986e2dbd7e48572eb24346f6893ec1610

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page