qlat.utils_io — Lock Management, Shutdown, and Time-Limit Utilities

Source: qlat/qlat/utils_io.pyx

Note: Update this document when updating the source file.

Outline

  1. Overview

  2. Lock Management

  3. Graceful Shutdown

  4. Time and Stop Checks

  5. Examples


Overview

The qlat.utils_io module provides MPI-aware process coordination primitives for long-running lattice QCD simulations. It wraps C++ utilities for:

  • Lock management — acquire and release filesystem locks so that multiple MPI processes or jobs can coordinate access to shared resources (e.g., gauge configurations on disk).

  • Graceful shutdown (qquit) — clean all Python and C++ caches, then terminate the program in an orderly fashion.

  • Time-limit checking — poll wall-clock budgets or SLURM job end-times to allow orderly exit before the scheduler kills the job.

  • Stop-file checking — poll for the existence of a sentinel file that signals the simulation should halt.

All functions in this module are available under the qlat (q) namespace after import qlat as q.


Lock Management

obtained_lock_history_list

A module-level Python list that records every path for which obtain_lock returned True. This provides an audit trail of locks acquired during the lifetime of the process and is used internally by release_lock and qquit.


obtain_lock(path: str) -> bool

Try to acquire a filesystem lock at path. Decorated with @q.timer.

Parameter

Type

Description

path

str

Filesystem path for the lock

Returns

bool

True if the lock was successfully acquired

If the lock is acquired, path is appended to obtained_lock_history_list. The underlying C++ implementation (cc.obtain_lock) uses filesystem-level locking (typically mkdir atomics) that is safe across MPI ranks.

if q.obtain_lock("/scratch/run_001/lock"):
    # this process owns the lock
    do_work()
    q.release_lock()

release_lock()

Release the currently held lock. Decorated with @q.timer.

Calls the C++ cc.release_lock() which removes the lock file/directory created by obtain_lock.


Graceful Shutdown

qquit(msg: str)

Clean all Python-level caches (via q.clean_cache()), then call the C++ cc.qquit(msg) which clears all C+±level caches and terminates the program.

Parameter

Type

Description

msg

str

Message printed before termination

This is the recommended way to exit an qlat program when an unrecoverable error is detected or a time/stop condition is met.

q.qquit("finished all trajectories")

Time and Stop Checks

check_time_limit(budget: float = None) -> None

Check whether the simulation is approaching its time limit. Decorated with @q.timer.

Parameter

Type

Default

Description

budget

float or None

None

Time budget in seconds. If None, read from q.get_time_budget()

Returns

None

Returns None if time remains; calls qquit() to terminate if limit is reached

When budget is None, the value is read from environment variables (in order of priority):

Variable

Description

q_budget

Budget in seconds, e.g. export q_budget="$((1 * 60 * 60))" for 1 hour

q_end_time

Absolute Unix timestamp, e.g. export q_end_time="$(($(date +%s) + 12 * 60 * 60))" or export q_end_time="$SLURM_JOB_END_TIME"

# Check with explicit 30-minute budget (terminates via qquit if limit reached)
q.check_time_limit(30 * 60)

# Check using environment variable (terminates via qquit if limit reached)
q.check_time_limit()

check_stop(fn: str = "stop.txt") -> None

Check whether a sentinel file fn exists in the current working directory. Decorated with @q.timer.

Parameter

Type

Default

Description

fn

str

"stop.txt"

Filename to check

Returns

None

Returns None if file does not exist; calls qquit() to terminate if file is found

This provides a simple mechanism for an operator or batch script to signal a running simulation to stop gracefully by creating a file.

# In the main simulation loop
for traj in range(start_traj, max_traj):
    do_trajectory(traj)
    q.check_stop()       # terminates via qquit if stop file found
    q.check_time_limit() # terminates via qquit if limit reached

Examples

Lock-Based Coordination

import qlat as q

size_node_list = [[1, 1, 1, 1]]
q.begin_with_mpi(size_node_list)

path = "/scratch/lattice_run/lock"
if q.obtain_lock(path):
    q.displayln_info("Lock acquired, performing critical section.")
    # ... critical section: read/write shared gauge config ...
    q.release_lock()
    q.displayln_info("Lock released.")
else:
    q.displayln_info("Could not acquire lock, skipping.")

q.end_with_mpi()

Main Loop with Time and Stop Checks

import qlat as q

size_node_list = [[1, 1, 1, 1]]
q.begin_with_mpi(size_node_list)

max_traj = 1000
for traj in range(max_traj):
    q.displayln_info(f"Starting trajectory {traj}")
    # ... run trajectory ...

    q.check_stop()       # terminates via qquit if stop file found
    q.check_time_limit() # terminates via qquit if limit reached

q.displayln_info("CHECK: finished successfully.")
q.end_with_mpi()

SLURM-Aware Time Management

# In your SLURM submission script:
export q_end_time="$SLURM_JOB_END_TIME"
export q_budget="$((30 * 60))"  # 30-minute warning margin

srun python3 run_simulation.py
import qlat as q

size_node_list = [[1, 1, 1, 1]]
q.begin_with_mpi(size_node_list)

# check_time_limit() automatically reads q_end_time and q_budget
# from the environment (terminates via qquit if limit reached)
q.check_time_limit()

q.end_with_mpi()