Skip to main content

GridTK: Slurm Job Managetment for Humans

Project description

docs build coverage repository

GridTK: Slurm Job Managetment for Humans

Introduction

GridTK is a powerful command-line tool designed to simplify the management of Slurm jobs. At its core, GridTK provides a drop-in replacement for sbatch, gridtk submit, which allows you to get started quickly. This tutorial will guide you through the process of using the gridtk script to efficiently manage your Slurm workloads. We will cover the basics of installation, submission, monitoring, and various commands provided by GridTK.

Prerequisites

Before diving into GridTK, ensure you have the following prerequisites:

  1. A working Slurm setup.
  2. pipx installed.
  3. GridTK installed (instructions provided below).

Installation

To install GridTK, open your terminal and run the following command:

$ pipx install gridtk

It is not recommennded to install GridTK using pip install gridtk in the same environment as your expeirments. GirdTK does not need to be installed in the same environment as your experiments and its depencencies may conflict with your experiments' dependencies.

Basic Usage

In this section, we will cover the basic commands and usage of the GridTK script. The primary goal is to help you get familiar with submitting, monitoring, and managing your Slurm jobs using GridTK.

Submitting a Job

To submit a job script, use gridtk submit. For example, given the script (job.sh) below:

#!/bin/bash

echo "Hello, GridTK!"

Submit the job using gridtk submit:

$ gridtk submit job.sh
1

where 1 is the the local job id (not the slurm job id) for your job. The job numbers always start with 1 which is easier to remember than the slurm job id.

gridtk submit is a drop-in replacement for sbatch and accepts the same options while adding its own. Run gridtk submit --help to see the list of gridtk submit specific options and run sbatch --help to see the full list of options for sbatch.

Note that your slurm cluster may require you to specify a partition, an account, or anoher option. You can do so by adding them to gridtk submit --accoount=myaccount --partition=mypartition job.sh or setting default values using enviroment variables such as SBATCH_ACCOUNT and SBATCH_PARTITION.

Monitoring Jobs

Use the gridtk list command to view the status of your jobs:

$ gridtk list
  job-id    grid-id  nodes    state        job-name    output                  dependencies    command
--------  ---------  -------  -----------  ----------  ----------------------  --------------  --------------------
       1     136132  None     PENDING (0)  gridtk      logs/gridtk.136132.out                  gridtk submit job.sh

gridtk list will only show jobs that are submitted using gridtk submit in the current folder. You can see the submitted job got a local job id of 1 and a slurm job id of 136132. It is in the PENDING state and its name is gridtk by default (it is recommended to give a meaningful name using the gridtk submit --job-name option). The output files are written to the logs/ directory by default (you may change the directory with the gridtk --logs-dir option). GridTK manages the log files for you, so you don't have to worry about knowing where they are stored or cleaning them up.

For detailed information about a specific job, use the report command:

$ gridtk report -j 1
Job ID: 1
Name: gridtk
State: COMPLETED (0)
Nodes: None
Submitted command: ['sbatch', '--job-name', 'gridtk', '--output', 'logs/gridtk.%j.out', '--error', 'logs/gridtk.%j.out', 'job.sh']
Output file: logs/gridtk.136132.out
Hello, GridTK!

where you can see the exact sbatch command that was used to submit the job and the output of the job.

Stopping and Deleting a Job

To stop a running or pending job, use the gridtk stop command:

$ gridtk stop -j 1
Stopped job 1 wiht slurm id 136132

Stopped jobs will be still avaliable in the job list:

$ gridtk list
  job-id    slurm-id  nodes    state          job-name    output                  dependencies    command
--------  ----------  -------  -------------  ----------  ----------------------  --------------  --------------------
       1      136137  None     CANCELLED (0)  gridtk      logs/gridtk.136137.out                  gridtk submit job.sh

and can be resubmitted using the gridtk resubmit command (more details on resubmit further down) and you can still view their output using the gridtk report command.

To delete a job (and its log file), use the gridtk delete command:

$ gridtk delete -j 1
Deleted job 1 with slurm id 136137

Resubmitting a Job

If a job fails or is stopped, you can resubmit it using the gridtk resubmit command:

$ gridtk submit job.sh
1

$ gridtk stop -j 1
Stopped job 1 wiht slurm id 136139

$ gridtk resubmit -j 1
Resubmitted job 1

$ gridtk list
  job-id    slurm-id  nodes    state        job-name    output                  dependencies    command
--------  ----------  -------  -----------  ----------  ----------------------  --------------  --------------------
       1      136140  None     PENDING (0)  gridtk      logs/gridtk.136140.out                  gridtk submit job.sh

Notice how the resubmitted job got a new slurm job id of 136140.

Advanced Usage

GridTK provides several advanced commands to help with more complex job management tasks. These include job dependencies, array jobs, and resource management.

Job Submission wihtout a Script

Since GridTK keeps track of both the sbatch options and the command to run, you can skip creating a script and submit a job directly from the command line. This is done by using --- (3 dashes) to separate the sbatch options from the command to run:

$ gridtk submit --job-name=gridtk-no-script --- echo 'Hello, GridTK!'
2

This syntax is unique to girdtk submit and is not supported by sbatch.

$ gridtk list
  job-id    slurm-id  nodes    state        job-name          output                            dependencies    command
--------  ----------  -------  -----------  ----------------  --------------------------------  --------------  ------------------------------------
       1      136140  None     PENDING (0)  gridtk            logs/gridtk.136140.out                            gridtk submit job.sh
       2      136142  None     PENDING (0)  gridtk-no-script  logs/gridtk-no-script.136142.out                  gridtk submit --- echo Hello, GridTK!

What happens is that gridtk submit creates a temporary script with the command to run and submits it to slurm. The temporary script is deleted after the job is submitted. The content of this temporary script can be viewed using the gridtk report command:

$ gridtk report -j 2
Job ID: 2
Name: gridtk-no-script
State: PENDING (0)
Nodes: None
Submitted command: ['sbatch', '--job-name', 'gridtk-no-script', '--output', 'logs/gridtk-no-script.%j.out', '--error', 'logs/gridtk-no-script.%j.out', '/tmp/tmpegoy2ma1.sh']
Content of the temporary script:
#!/bin/bash
echo 'Hello, GridTK!'

Output file: logs/gridtk-no-script.136142.out

This is a fast, convenient, and recomended way to submit a job without having to create a script and since everthing is tracked by GridTK, you still benefit from the same reproducibility gurantees.

Job Dependencies

To submit a job that depends on another job, use the --dependency flag:

$ gridtk submit --dependency=<job_id> job.sh

The --dependency flag takes the same values as in sbatch except that you need to specify local job ids instead of slurm job ids.

Repeat Jobs

You can submit the same script N times using the --repeat flag:

$ gridtk submit --repeat=3 job.sh

This will submit 3 jobs with the same script and the same options where each job will depeend on the previous one. This is useful if your script can resume from a checkpoint and you want to run it effectively for a longer time than allowed by polciy.

Monitoring Jobs

While gridtk list and gridtk report are useful for checking the status of jobs, you might get more information about your jobs using squeue, scontrol, and sacct. Here are some usefull commands:

  • Get information about a specific job: scontrol show job <slurm_job_id>
  • Get information about a completed or failed job: sacct -j <slurm_job_id>.
  • See ALL your jobs: squeue --me
  • Cancel ALL your jobs: scancel --me
  • View current QOS policies:
    sacctmgr show qos format=Name%20,Priority,Flags%30,MaxWall,MaxTRESPU%20,MaxJobsPU,MaxSubmitPU,MaxTRESPA%25
    
  • Find out which accounts your username has access to:
    sacctmgr list associations
    # or
    sacctmgr -n -p list assoc where user=$USER | awk '-F|' '{print "   "$2}'
    

Tab Completion

GridTK supports tab completion for the gridtk command. To enable it, add the following line to your ~/.bashrc file:

eval "$(_GRIDTK_COMPLETE=bash_source gridtk)"

or for zsh add the following line to your ~/.zshrc file:

eval "$(_GRIDTK_COMPLETE=zsh_source gridtk)"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gridtk-3.0.0.tar.gz (42.3 kB view details)

Uploaded Source

Built Distribution

gridtk-3.0.0-py3-none-any.whl (28.9 kB view details)

Uploaded Python 3

File details

Details for the file gridtk-3.0.0.tar.gz.

File metadata

  • Download URL: gridtk-3.0.0.tar.gz
  • Upload date:
  • Size: 42.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for gridtk-3.0.0.tar.gz
Algorithm Hash digest
SHA256 a64154bbb42bc80001299bc25b39faf8b7ff57d60d730fdad6959220b9fa705f
MD5 5a32ee75b3a332a3b036a04f15ec8371
BLAKE2b-256 b691146c0b73e5e1a91c5806239528a7d43b2ca9a20fea5d310518a643766a3c

See more details on using hashes here.

File details

Details for the file gridtk-3.0.0-py3-none-any.whl.

File metadata

  • Download URL: gridtk-3.0.0-py3-none-any.whl
  • Upload date:
  • Size: 28.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.0 CPython/3.12.4

File hashes

Hashes for gridtk-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c1ee19a17c84aa66400b2a2d9dffd2a13696b0a046f030aa32b8f7e0138747b5
MD5 d72452fb22f14f7d5202e707afdbc2a4
BLAKE2b-256 0bf40fff47b08a2ede2965cf75cd88aa89a5dbe6d369ec0f36fd6254014b6c30

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page