Automatic Job Requeuing (Checkpointing)

For long-running jobs that exceed the maximum walltime of a partition, you can implement a “self-requeuing” workflow. This allows a job to save its progress (checkpoint), exit gracefully before the time limit is reached, and automatically place itself back in the queue to resume execution.

This workflow relies on three components:

  1. Slurm Signals: Requesting a signal (e.g., SIGTERM) before the time limit.

  2. Signal Handling: The application catching that signal, saving state, and exiting with a specific code (e.g., 99).

  3. Bash Logic: The submission script detecting that specific code and running scontrol requeue.

1. Submission Script Example

In your .sh file, you must request a signal using #SBATCH --signal and handle the exit code logic after the application finishes.

#! /bin/bash

#SBATCH --job-name=requeue_example
#SBATCH --output=logs/%x_%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
#SBATCH --mem=16G
#SBATCH --partition=general-gpu

# CRITICAL: Send SIGTERM to the job 60 seconds before time runs out, change to what you need for your application.
#SBATCH --signal=TERM@60

# Load your environment
source my-env/bin/activate

echo "Job started on node: $SLURM_NODELIST"

# Run the application
# Note: The application must catch SIGTERM and exit with code 99 to trigger a requeue
srun python main.py
EXIT_CODE=$?

echo "Program exited with code: $EXIT_CODE"

# Check exit codes to determine next steps
if [ $EXIT_CODE -eq 0 ]; then
    echo "Success: Job completed."
    exit 0
elif [ $EXIT_CODE -eq 99 ]; then
    echo "Checkpoint triggered (Exit Code 99). Requeuing job..."
    
    # Requeue this specific job ID to run again
    # The job retains its Job ID but increments its restart count
    scontrol requeue $SLURM_JOB_ID
else
    echo "Failure: Job crashed with generic error."
    exit $EXIT_CODE
fi

2. Application Code Example (Python)

Your application needs to listen for the signal sent by Slurm. When received, it should save its state and exit with the designated “requeue” code (99 in this example).

import signal
import sys
import time

def signal_handler(signum, frame):
    """
    Catches the Slurm signal (SIGTERM) before the job times out.
    """
    print(f"Signal {signum} received! Time limit approaching.")
    
    # 1. Save your progress to disk (checkpoint)
    # save_checkpoint()
    
    print("Checkpoint saved. Exiting with code 99 to trigger requeue.")
    
    # 2. Exit with the specific code expected by the bash script
    sys.exit(99) 

def main():
    # Register the signal handler for SIGTERM
    signal.signal(signal.SIGTERM, signal_handler)

    print("Starting calculation...")
    
    # Check for existing checkpoints here to resume work
    # load_checkpoint()

    # Main generic work loop
    for i in range(1, 10000):
        # ... perform work ...
        time.sleep(1)
        
        # If the job runs out of time during this loop, Slurm sends SIGTERM,
        # triggering the signal_handler defined above.

if __name__ == "__main__":
    main()

Key Configuration Details

  • #SBATCH --signal=TERM@60: This directive tells Slurm to send the SIGTERM signal to your process 60 seconds before the walltime limit is hit. You should adjust the seconds (e.g., @120) to give your application enough time to write its checkpoint file.

  • scontrol requeue $SLURM_JOB_ID: This command places the running job back into the PENDING state. It keeps the same Job ID.

  • Exit Codes: While 99 is used in this example, you can use any unused exit code (1-255). Ensure your Bash script checks for the exact code your application uses.