Commit 484dc3a1 authored by Dan Notestein's avatar Dan Notestein
Browse files

fix: prevent race condition when copying NFS cache to local

When multiple CI jobs start simultaneously and need the same cache,
they all check for the local tar file. If not found, they check NFS.
The first job to find the NFS cache starts copying it to local, but
the file is created immediately (incomplete). Other jobs see this
partial file and try to extract from it, causing "tar: short read".

Fix by using:
1. Exclusive lock while copying to local cache
2. Atomic rename (.tmp -> final) so file only appears when complete
3. Re-check after acquiring lock in case another job finished first
4. 60s timeout with fallback to direct NFS extraction

This was observed in reputation_tracker pipeline #146323 where
regression-test and performance-test failed with tar short read
while setup-scripts-test succeeded (it was first to copy).
parent a7533dcc
Loading
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment