fix: prevent race condition when copying NFS cache to local
Problem
When multiple CI jobs start simultaneously and need the same cache, they race to copy from NFS to local cache. The first job starts copying, but the file is created immediately (incomplete). Other jobs see this partial file and try to extract from it, causing tar: short read errors.
Observed in: reputation_tracker pipeline #146323
-
regression-testandperformance-testfailed with tar short read (saw 0.24GB and 1.19GB partial files) -
setup-scripts-testsucceeded (was first to copy, got full 14.24GB)
Solution
Use locking + atomic rename when copying NFS cache to local:
- Exclusive lock while copying to local cache - other jobs wait
- Atomic rename (.tmp → final) so file only appears when complete
- Re-check after lock in case another job finished first
- 60s timeout with fallback to direct NFS extraction if lock can't be acquired
Testing
The fix can be tested by:
- Updating reputation_tracker to use this branch
- Re-running the failed pipeline
- All concurrent test jobs should now either wait for the first copy or use the completed file