Skip to main content

Implementation of the 'Gotta be SAFE: a new framework for molecular design' paper

Project description

:safety_vest: SAFE

Sequential Attachment-based Fragment Embedding (SAFE) is a novel molecular line notation that represents molecules as an unordered sequence of fragment blocks to improve molecule design using generative models.



Paper | Docs | 🤗 Model | 🤗 Training Dataset



PyPI Conda PyPI - Downloads Conda PyPI - Python Version Code license Data License GitHub Repo stars GitHub Repo stars arXiv

test release code-check doc

Overview of SAFE

SAFE is the deep learning molecular representation. It's an encoding leveraging a peculiarity in the decoding schemes of SMILES, to allow representation of molecules as a contiguous sequence of connected fragments. SAFE strings are valid SMILES strings, and thus are able to preserve the same amount of information. The intuitive representation of molecules as an ordered sequence of connected fragments greatly simplifies the following tasks often encountered in molecular design:

  • de novo design
  • superstructure generation
  • scaffold decoration
  • motif extension
  • linker generation
  • scaffold morphing.

The construction of a SAFE strings requires defining a molecular fragmentation algorithm. By default, we use [BRICS], but any other fragmentation algorithm can be used. The image below illustrates the process of building a SAFE string. The resulting string is a valid SMILES that can be read by datamol or RDKit.


Installation

You can install safe using pip:

pip install safe-mol

You can use conda/mamba:

mamba install -c conda-forge safe-mol

Datasets and Models

Type Name Infos Size Comment
Model datamol-io/safe-gpt 87M params 350M Default model
Training Dataset datamol-io/safe-gpt 1.1B rows 250GB Training dataset
Drug Benchmark Dataset datamol-io/safe-drugs 26 rows 20 kB Benchmarking dataset

Usage

Please refer to the documentation, which contains tutorials for getting started with safe and detailed descriptions of the functions provided, as well as an example of how to get started with SAFE-GPT.

API

We summarize some key functions provided by the safe package below.

Function Description
safe.encode Translates a SMILES string into its corresponding SAFE string.
safe.decode Translates a SAFE string into its corresponding SMILES string. The SAFE decoder just augment RDKit's Chem.MolFromSmiles with an optional correction argument to take care of missing hydrogen bonds.
safe.split Tokenizes a SAFE string to build a generative model.

Examples

Translation between SAFE and SMILES representations

import safe

ibuprofen = "CC(Cc1ccc(cc1)C(C(=O)O)C)C"

# SMILES -> SAFE -> SMILES translation
try:
    ibuprofen_sf = safe.encode(ibuprofen)  # c12ccc3cc1.C3(C)C(=O)O.CC(C)C2
    ibuprofen_smi = safe.decode(ibuprofen_sf, canonical=True)  # CC(C)Cc1ccc(C(C)C(=O)O)cc1
except safe.EncoderError:
    pass
except safe.DecoderError:
    pass

ibuprofen_tokens = list(safe.split(ibuprofen_sf))

Training/Finetuning a (new) model

A command line interface is available to train a new model, please run safe-train --help. You can also provide an existing checkpoint to continue training or finetune on you own dataset.

For example:

safe-train --config <path to config> \
    --model-path <path to model> \
    --tokenizer  <path to tokenizer> \
    --dataset <path to dataset> \
    --num_labels 9 \
    --torch_compile True \
    --optim "adamw_torch" \
    --learning_rate 1e-5 \
    --prop_loss_coeff 1e-3 \
    --gradient_accumulation_steps 1 \
    --output_dir "<path to outputdir>" \
    --max_steps 5

References

If you use this repository, please cite the following related paper:

@misc{noutahi2023gotta,
      title={Gotta be SAFE: A New Framework for Molecular Design},
      author={Emmanuel Noutahi and Cristian Gabellini and Michael Craig and Jonathan S. C Lim and Prudencio Tossou},
      year={2023},
      eprint={2310.10773},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

Note that all data and model weights of SAFE are exclusively licensed for research purposes. The accompanying dataset is licensed under CC BY 4.0, which permits solely non-commercial usage. See DATA_LICENSE for details.

This code base is licensed under the Apache-2.0 license. See LICENSE for details.

Development lifecycle

Setup dev environment

mamba create -n safe -f env.yml
mamba activate safe

pip install --no-deps -e .

Tests

You can run tests locally with:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safe-mol-0.1.4.tar.gz (418.7 kB view details)

Uploaded Source

Built Distribution

safe_mol-0.1.4-py3-none-any.whl (50.6 kB view details)

Uploaded Python 3

File details

Details for the file safe-mol-0.1.4.tar.gz.

File metadata

  • Download URL: safe-mol-0.1.4.tar.gz
  • Upload date:
  • Size: 418.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for safe-mol-0.1.4.tar.gz
Algorithm Hash digest
SHA256 b094191198c4aff5fa813800af5173f4b46d0435e0cd2462fcbd0aac54477a70
MD5 20c0eb361c9e3c8cff846d5febff639a
BLAKE2b-256 50635dc01c3f61837576c1079340bf3ebfa20d1dedaaf01f50eb1b4a274cc308

See more details on using hashes here.

File details

Details for the file safe_mol-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: safe_mol-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 50.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for safe_mol-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 58532f3664b2bfd948e1600ddb186ba71ed1305aee2e48c20a42da28d40c306f
MD5 672efd3d29ed1dce5ad98020e3c967ec
BLAKE2b-256 a227af08b5c616e5df8e0ce24de8e35aea118e4b13c382b84d8eef72ed0b3540

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page