fix: prevent race condition when copying NFS cache to local

Problem

When multiple CI jobs start simultaneously and need the same cache, they race to copy from NFS to local cache. The first job starts copying, but the file is created immediately (incomplete). Other jobs see this partial file and try to extract from it, causing tar: short read errors.

Observed in: reputation_tracker pipeline #146323

  • regression-test and performance-test failed with tar short read (saw 0.24GB and 1.19GB partial files)
  • setup-scripts-test succeeded (was first to copy, got full 14.24GB)

Solution

Use locking + atomic rename when copying NFS cache to local:

  1. Exclusive lock while copying to local cache - other jobs wait
  2. Atomic rename (.tmp → final) so file only appears when complete
  3. Re-check after lock in case another job finished first
  4. 60s timeout with fallback to direct NFS extraction if lock can't be acquired

Testing

The fix can be tested by:

  1. Updating reputation_tracker to use this branch
  2. Re-running the failed pipeline
  3. All concurrent test jobs should now either wait for the first copy or use the completed file

Merge request reports

Loading