`qlat.utils_io` — Lock Management, Shutdown, and Time-Limit Utilities¶

Source: qlat/qlat/utils_io.pyx

Note: Update this document when updating the source file.

Outline¶

Overview
Lock Management
Graceful Shutdown
Time and Stop Checks
Examples

Overview¶

The qlat.utils_io module provides MPI-aware process coordination primitives for long-running lattice QCD simulations. It wraps C++ utilities for:

Lock management — acquire and release filesystem locks so that multiple MPI processes or jobs can coordinate access to shared resources (e.g., gauge configurations on disk).
Graceful shutdown (qquit) — clean all Python and C++ caches, then terminate the program in an orderly fashion.
Time-limit checking — poll wall-clock budgets or SLURM job end-times to allow orderly exit before the scheduler kills the job.
Stop-file checking — poll for the existence of a sentinel file that signals the simulation should halt.

All functions in this module are available under the qlat (q) namespace after import qlat as q.

Lock Management¶

`obtained_lock_history_list`¶

A module-level Python list that records every path for which obtain_lock returned True. This provides an audit trail of locks acquired during the lifetime of the process and is used internally by release_lock and qquit.

`obtain_lock(path: str) -> bool`¶

Try to acquire a filesystem lock at path. Decorated with @q.timer.

Parameter	Type	Description
`path`	`str`	Filesystem path for the lock
Returns	`bool`	`True` if the lock was successfully acquired

If the lock is acquired, path is appended to obtained_lock_history_list. The underlying C++ implementation (cc.obtain_lock) uses filesystem-level locking (typically mkdir atomics) that is safe across MPI ranks.

if q.obtain_lock("/scratch/run_001/lock"):
    # this process owns the lock
    do_work()
    q.release_lock()

`release_lock()`¶

Release the currently held lock. Decorated with @q.timer.

Calls the C++ cc.release_lock() which removes the lock file/directory created by obtain_lock.

Graceful Shutdown¶

`qquit(msg: str)`¶

Clean all Python-level caches (via q.clean_cache()), then call the C++ cc.qquit(msg) which clears all C+±level caches and terminates the program.

Parameter	Type	Description
`msg`	`str`	Message printed before termination

This is the recommended way to exit an qlat program when an unrecoverable error is detected or a time/stop condition is met.

q.qquit("finished all trajectories")

Time and Stop Checks¶

`check_time_limit(budget: float = None) -> None`¶

Check whether the simulation is approaching its time limit. Decorated with @q.timer.

Parameter	Type	Default	Description
`budget`	`float` or `None`	`None`	Time budget in seconds. If `None`, read from `q.get_time_budget()`
Returns	`None`		Returns `None` if time remains; calls `qquit()` to terminate if limit is reached

When budget is None, the value is read from environment variables (in order of priority):

Variable	Description
`q_budget`	Budget in seconds, e.g. `export q_budget="$((1 * 60 * 60))"` for 1 hour
`q_end_time`	Absolute Unix timestamp, e.g. `export q_end_time="$(($(date +%s) + 12 * 60 * 60))"` or `export q_end_time="$SLURM_JOB_END_TIME"`

# Check with explicit 30-minute budget (terminates via qquit if limit reached)
q.check_time_limit(30 * 60)

# Check using environment variable (terminates via qquit if limit reached)
q.check_time_limit()

`check_stop(fn: str = "stop.txt") -> None`¶

Check whether a sentinel file fn exists in the current working directory. Decorated with @q.timer.

Parameter	Type	Default	Description
`fn`	`str`	`"stop.txt"`	Filename to check
Returns	`None`		Returns `None` if file does not exist; calls `qquit()` to terminate if file is found

This provides a simple mechanism for an operator or batch script to signal a running simulation to stop gracefully by creating a file.

# In the main simulation loop
for traj in range(start_traj, max_traj):
    do_trajectory(traj)
    q.check_stop()       # terminates via qquit if stop file found
    q.check_time_limit() # terminates via qquit if limit reached

Examples¶

Lock-Based Coordination¶

import qlat as q

size_node_list = [[1, 1, 1, 1]]
q.begin_with_mpi(size_node_list)

path = "/scratch/lattice_run/lock"
if q.obtain_lock(path):
    q.displayln_info("Lock acquired, performing critical section.")
    # ... critical section: read/write shared gauge config ...
    q.release_lock()
    q.displayln_info("Lock released.")
else:
    q.displayln_info("Could not acquire lock, skipping.")

q.end_with_mpi()

Main Loop with Time and Stop Checks¶

import qlat as q

size_node_list = [[1, 1, 1, 1]]
q.begin_with_mpi(size_node_list)

max_traj = 1000
for traj in range(max_traj):
    q.displayln_info(f"Starting trajectory {traj}")
    # ... run trajectory ...

    q.check_stop()       # terminates via qquit if stop file found
    q.check_time_limit() # terminates via qquit if limit reached

q.displayln_info("CHECK: finished successfully.")
q.end_with_mpi()

SLURM-Aware Time Management¶

# In your SLURM submission script:
export q_end_time="$SLURM_JOB_END_TIME"
export q_budget="$((30 * 60))"  # 30-minute warning margin

srun python3 run_simulation.py

import qlat as q

size_node_list = [[1, 1, 1, 1]]
q.begin_with_mpi(size_node_list)

# check_time_limit() automatically reads q_end_time and q_budget
# from the environment (terminates via qquit if limit reached)
q.check_time_limit()

q.end_with_mpi()

`qlat.utils_io` — Lock Management, Shutdown, and Time-Limit Utilities¶

Outline¶

Overview¶

Lock Management¶

`obtained_lock_history_list`¶

`obtain_lock(path: str) -> bool`¶

`release_lock()`¶

Graceful Shutdown¶

`qquit(msg: str)`¶

Time and Stop Checks¶

`check_time_limit(budget: float = None) -> None`¶

`check_stop(fn: str = "stop.txt") -> None`¶

Examples¶

Lock-Based Coordination¶

Main Loop with Time and Stop Checks¶

SLURM-Aware Time Management¶

Qlattice

Navigation

Related Topics

qlat.utils_io — Lock Management, Shutdown, and Time-Limit Utilities¶

Outline¶

Overview¶

Lock Management¶

obtained_lock_history_list¶

obtain_lock(path: str) -> bool¶

release_lock()¶

Graceful Shutdown¶

qquit(msg: str)¶

Time and Stop Checks¶

check_time_limit(budget: float = None) -> None¶

check_stop(fn: str = "stop.txt") -> None¶

Examples¶

Lock-Based Coordination¶

Main Loop with Time and Stop Checks¶

SLURM-Aware Time Management¶

`qlat.utils_io` — Lock Management, Shutdown, and Time-Limit Utilities¶

`obtained_lock_history_list`¶

`obtain_lock(path: str) -> bool`¶

`release_lock()`¶

`qquit(msg: str)`¶

`check_time_limit(budget: float = None) -> None`¶

`check_stop(fn: str = "stop.txt") -> None`¶