qlat.utils_io — Lock Management, Shutdown, and Time-Limit Utilities¶
Source: qlat/qlat/utils_io.pyx
Note: Update this document when updating the source file.
Outline¶
Overview¶
The qlat.utils_io module provides MPI-aware process coordination primitives
for long-running lattice QCD simulations. It wraps C++ utilities for:
Lock management — acquire and release filesystem locks so that multiple MPI processes or jobs can coordinate access to shared resources (e.g., gauge configurations on disk).
Graceful shutdown (
qquit) — clean all Python and C++ caches, then terminate the program in an orderly fashion.Time-limit checking — poll wall-clock budgets or SLURM job end-times to allow orderly exit before the scheduler kills the job.
Stop-file checking — poll for the existence of a sentinel file that signals the simulation should halt.
All functions in this module are available under the qlat (q) namespace
after import qlat as q.
Lock Management¶
obtained_lock_history_list¶
A module-level Python list that records every path for which
obtain_lock returned True. This provides an audit trail of locks
acquired during the lifetime of the process and is used internally by
release_lock and qquit.
obtain_lock(path: str) -> bool¶
Try to acquire a filesystem lock at path. Decorated with @q.timer.
Parameter |
Type |
Description |
|---|---|---|
|
|
Filesystem path for the lock |
Returns |
|
|
If the lock is acquired, path is appended to obtained_lock_history_list.
The underlying C++ implementation (cc.obtain_lock) uses filesystem-level
locking (typically mkdir atomics) that is safe across MPI ranks.
if q.obtain_lock("/scratch/run_001/lock"):
# this process owns the lock
do_work()
q.release_lock()
release_lock()¶
Release the currently held lock. Decorated with @q.timer.
Calls the C++ cc.release_lock() which removes the lock file/directory
created by obtain_lock.
Graceful Shutdown¶
qquit(msg: str)¶
Clean all Python-level caches (via q.clean_cache()), then call the C++
cc.qquit(msg) which clears all C+±level caches and terminates the program.
Parameter |
Type |
Description |
|---|---|---|
|
|
Message printed before termination |
This is the recommended way to exit an qlat program when an unrecoverable error is detected or a time/stop condition is met.
q.qquit("finished all trajectories")
Time and Stop Checks¶
check_time_limit(budget: float = None) -> None¶
Check whether the simulation is approaching its time limit. Decorated with
@q.timer.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Time budget in seconds. If |
Returns |
|
Returns |
When budget is None, the value is read from environment variables (in
order of priority):
Variable |
Description |
|---|---|
|
Budget in seconds, e.g. |
|
Absolute Unix timestamp, e.g. |
# Check with explicit 30-minute budget (terminates via qquit if limit reached)
q.check_time_limit(30 * 60)
# Check using environment variable (terminates via qquit if limit reached)
q.check_time_limit()
check_stop(fn: str = "stop.txt") -> None¶
Check whether a sentinel file fn exists in the current working directory.
Decorated with @q.timer.
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Filename to check |
Returns |
|
Returns |
This provides a simple mechanism for an operator or batch script to signal a running simulation to stop gracefully by creating a file.
# In the main simulation loop
for traj in range(start_traj, max_traj):
do_trajectory(traj)
q.check_stop() # terminates via qquit if stop file found
q.check_time_limit() # terminates via qquit if limit reached
Examples¶
Lock-Based Coordination¶
import qlat as q
size_node_list = [[1, 1, 1, 1]]
q.begin_with_mpi(size_node_list)
path = "/scratch/lattice_run/lock"
if q.obtain_lock(path):
q.displayln_info("Lock acquired, performing critical section.")
# ... critical section: read/write shared gauge config ...
q.release_lock()
q.displayln_info("Lock released.")
else:
q.displayln_info("Could not acquire lock, skipping.")
q.end_with_mpi()
Main Loop with Time and Stop Checks¶
import qlat as q
size_node_list = [[1, 1, 1, 1]]
q.begin_with_mpi(size_node_list)
max_traj = 1000
for traj in range(max_traj):
q.displayln_info(f"Starting trajectory {traj}")
# ... run trajectory ...
q.check_stop() # terminates via qquit if stop file found
q.check_time_limit() # terminates via qquit if limit reached
q.displayln_info("CHECK: finished successfully.")
q.end_with_mpi()
SLURM-Aware Time Management¶
# In your SLURM submission script:
export q_end_time="$SLURM_JOB_END_TIME"
export q_budget="$((30 * 60))" # 30-minute warning margin
srun python3 run_simulation.py
import qlat as q
size_node_list = [[1, 1, 1, 1]]
q.begin_with_mpi(size_node_list)
# check_time_limit() automatically reads q_end_time and q_budget
# from the environment (terminates via qquit if limit reached)
q.check_time_limit()
q.end_with_mpi()