fix: prevent race condition when copying NFS cache to local (!152) · Merge requests · hive / Common CI Configuration

Problem

When multiple CI jobs start simultaneously and need the same cache, they race to copy from NFS to local cache. The first job starts copying, but the file is created immediately (incomplete). Other jobs see this partial file and try to extract from it, causing tar: short read errors.

Observed in: reputation_tracker pipeline #146323

regression-test and performance-test failed with tar short read (saw 0.24GB and 1.19GB partial files)
setup-scripts-test succeeded (was first to copy, got full 14.24GB)

Solution

Use locking + atomic rename when copying NFS cache to local:

Exclusive lock while copying to local cache - other jobs wait
Atomic rename (.tmp → final) so file only appears when complete
Re-check after lock in case another job finished first
60s timeout with fallback to direct NFS extraction if lock can't be acquired

Testing

The fix can be tested by:

Updating reputation_tracker to use this branch
Re-running the failed pipeline
All concurrent test jobs should now either wait for the first copy or use the completed file

fix: prevent race condition when copying NFS cache to local

Problem

Solution

Testing

Merge request reports