From 0990a5541aa1ce3c5328781eea801d35ba71a029 Mon Sep 17 00:00:00 2001 From: Dan Notestein Date: Sun, 28 Dec 2025 18:06:49 -0500 Subject: [PATCH 1/4] Refactor cache-manager to use immutable tar-based local cache - Store local cache as tar files instead of directories/symlinks - Local-first PUT: create local tar, then push to NFS - Always extract fresh on GET (prevents cache corruption from job modifications) - Unified tar format for both NFS host and clients - Remove legacy directory format support - Add cache-manager.md documentation --- docs/cache-manager.md | 358 ++++++++++++++++++++++++++++++++++++ scripts/cache-manager.sh | 379 +++++++++++++++++---------------------- 2 files changed, 520 insertions(+), 217 deletions(-) create mode 100644 docs/cache-manager.md diff --git a/docs/cache-manager.md b/docs/cache-manager.md new file mode 100644 index 0000000..108efc1 --- /dev/null +++ b/docs/cache-manager.md @@ -0,0 +1,358 @@ +# cache-manager.sh + +Centralized CI cache manager with NFS backing for cross-builder cache sharing. + +## Overview + +The cache-manager provides a shared caching layer for CI jobs across multiple build servers. It uses NFS as the shared storage backend with local caching for performance, and implements LRU eviction for space management. + +**Primary use cases:** +- HAF replay data caching (PostgreSQL data directories) +- Hive replay data caching +- Downstream project caches (hivemind, balance_tracker, etc.) + +**Storage format:** All caches are stored as tar archives (`.tar`) for consistent behavior and optimal NFS performance. + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ NFS Server │ +│ (hive-builder-10) │ +│ /storage1/ci-cache/ ←──symlink── /nfs/ci-cache/ │ +│ ├── hive/ │ +│ │ └── .tar │ +│ ├── haf/ │ +│ │ └── .tar │ +│ ├── .lru_index │ +│ └── .global_lock │ +└─────────────────────────────────────────────────────────────────┘ + │ + NFS mount + │ +┌─────────────────────────────┼─────────────────────────────────┐ +│ hive-builder-8 │ hive-builder-9 │ +│ /nfs/ci-cache/ (mount) │ /nfs/ci-cache/ (mount) │ +│ /cache/ (local SSD) │ /cache/ (local SSD) │ +│ └── hive_.tar │ └── haf_.tar │ +└─────────────────────────────┴─────────────────────────────────┘ +``` + +### Immutable Local Cache + +Local caches are stored as tar files, not directories. This ensures: +- **Immutability**: Jobs cannot accidentally corrupt the cache by modifying extracted files +- **Clean extraction**: Each job gets a pristine copy of the data +- **Predictable behavior**: No risk of state leaking between jobs + +### NFS Host vs NFS Client + +The script automatically detects whether it's running on the NFS server or a client: + +- **NFS Host** (hive-builder-10): `/nfs/ci-cache` is a symlink to local storage. No network I/O for cache operations. +- **NFS Clients**: `/nfs/ci-cache` is an NFS mount point. Cache reads/writes go over the network. + +Both NFS host and clients use the same tar archive format for consistency. On the NFS host, tar operations are local I/O (fast), while on clients they go over the network. + +## Commands + +### get + +```bash +cache-manager.sh get +``` + +Retrieves a cache entry. Search order: +1. Local tar cache (`/cache/_.tar`) +2. NFS tar archive (`/nfs/ci-cache//.tar`) + +**Always extracts fresh**: Even on local cache hit, the tar is extracted to ``. This ensures jobs get pristine data and cannot corrupt the cache. + +On NFS cache hit, the tar is also copied to local cache for future use on the same builder. + +**Locking:** Uses shared lock (`flock -s`) during extraction so multiple jobs can read simultaneously. + +### put + +```bash +cache-manager.sh put +``` + +Stores a cache entry to NFS as a tar archive. + +**Locking:** Uses exclusive lock (`flock -x`) to prevent concurrent writes. + +**Behavior:** +1. Creates local tar first (always succeeds if disk space available) +2. Pushes local tar to NFS (if available) +3. If NFS unavailable or push fails, local cache still exists + +This "local-first" approach ensures the builder always has a local cache after PUT, avoiding NFS fetch if the next job lands on the same builder. + +**Optimizations:** +- Skips if cache already exists on NFS (but ensures local copy exists) +- Uses atomic rename (`.tmp` → `.tar`) to prevent partial writes +- Excludes `datadir/blockchain` from hive/haf caches (jobs use local block_log mount) + +### cleanup + +```bash +cache-manager.sh cleanup [--max-size-gb N] [--max-age-days N] +``` + +Removes old caches using LRU eviction. Triggered automatically when cache reaches 90% capacity. + +### list / status + +```bash +cache-manager.sh list [cache-type] +cache-manager.sh status +``` + +Display cache contents and overall status. + +### is-fast-builder + +```bash +cache-manager.sh is-fast-builder +``` + +Returns 0 if running on a fast builder (AMD 5950/5900/EPYC). Used for job scheduling decisions. + +## Locking Mechanism + +### Lock Types + +| Lock | File | Mode | Purpose | +|------|------|------|---------| +| Tar lock | `.tar.lock` | Exclusive (`-x`) for PUT, Shared (`-s`) for GET | Serialize tar archive access | +| Global lock | `.global_lock` | Exclusive | LRU index updates | + +### Lock Holder Info + +When acquiring a lock, the script writes debug info to `.info`: + +``` +hostname=hive-builder-8 +pid=12345 +started=2025-01-15T10:30:00+00:00 +job_id=1234567 +pipeline_id=98765 +``` + +This helps diagnose stale locks. + +### BusyBox Compatibility + +Alpine-based images (docker-builder, docker-dind) may have BusyBox flock which lacks timeout support. The script detects this and falls back to a retry loop: + +```bash +# GNU coreutils flock +flock -x -w 120 lockfile command + +# BusyBox fallback (retry every 5s for up to timeout) +while [[ $elapsed -lt $timeout ]]; do + flock -x -n lockfile command && return 0 + sleep 5 +done +``` + +**Note:** util-linux must be installed in Alpine images for proper NFS flock support. BusyBox flock returns "Bad file descriptor" on NFS mounts. + +## Handling Failures and Canceled Jobs + +### Stale Lock Detection + +Jobs can be canceled mid-operation, leaving lock files behind. The script handles this: + +1. **Age check:** Lock files older than `CACHE_STALE_LOCK_MINUTES` (default: 10) are considered potentially stale + +2. **Active lock check:** Tests if the lock is actually held: + ```bash + flock -n "$lockfile" -c "true" # Returns 0 if NOT held + ``` + +3. **Stale lock cleanup:** + - If lock file is old AND not actively held: silently remove it + - If lock file is old AND still held: log warning with holder info, then break it + +```bash +_check_stale_lock() { + # Lock older than threshold? + if [[ $lock_age_minutes -lt $stale_minutes ]]; then + return 1 # Not stale + fi + + # Lock actually held? + if flock -n "$lockfile" -c "true"; then + rm -f "$lockfile" # Just a leftover file + return 0 + fi + + # Held but ancient - break it + _log "Breaking stale lock (${lock_age_minutes} min old)" + rm -f "$lockfile" "${lockfile}.info" + return 0 +} +``` + +### Incomplete Cache Entries + +**PUT operations use atomic rename:** + +```bash +tar cf '$NFS_TAR_FILE.tmp' -C '$local_source' . +mv '$NFS_TAR_FILE.tmp' '$NFS_TAR_FILE' # Atomic on POSIX +``` + +If a job is canceled during tar creation: +- The `.tmp` file remains (incomplete) +- The final `.tar` file doesn't exist +- Next job sees cache miss and creates a fresh cache +- Old `.tmp` files can be cleaned up manually or by periodic maintenance + +**Double-check after lock acquisition:** + +```bash +# Inside locked section +if [[ -f '$NFS_TAR_FILE' ]]; then + echo 'Cache was created while waiting for lock' + exit 0 +fi +``` + +This prevents duplicate work when multiple jobs race to create the same cache. + +### Cleanup During Operations + +The `cleanup` command skips entries that are currently locked: + +```bash +if ! flock -n "$lock_file" -c "true"; then + _log "Skipping $entry - currently locked" + continue +fi +``` + +## PostgreSQL Data Handling + +HAF caches contain PostgreSQL data directories which require special handling: + +### Permission Management + +PostgreSQL requires `pgdata` to be mode 700, owned by the postgres user (uid 105). + +**Before caching (relax permissions):** +```bash +sudo chmod -R a+rX "$pgdata_path" # Make readable for tar +``` + +**After extraction (restore permissions):** +```bash +sudo chmod 700 "$pgdata_path" +sudo chown -R 105:105 "$pgdata_path" +``` + +### Tablespace Symlink Handling + +PostgreSQL creates absolute symlinks in `pg_tblspc/`: +``` +pg_tblspc/16396 -> /home/hived/datadir/haf_db_store/tablespace +``` + +These break when data is extracted to a different location. The script converts them to relative paths: +``` +pg_tblspc/16396 -> ../../tablespace +``` + +### WAL File Preservation + +All WAL files are kept in the cache. Previously there was an attempt to exclude WAL files to save ~5.8GB, but this caused data corruption during PostgreSQL crash recovery. + +## Configuration + +| Variable | Default | Description | +|----------|---------|-------------| +| `CACHE_NFS_PATH` | `/nfs/ci-cache` | NFS mount point | +| `CACHE_LOCAL_PATH` | `/cache` | Local cache directory | +| `CACHE_MAX_SIZE_GB` | `2000` | Max total NFS cache size | +| `CACHE_MAX_AGE_DAYS` | `30` | Max cache age for eviction | +| `CACHE_LOCK_TIMEOUT` | `120` | Lock timeout in seconds | +| `CACHE_STALE_LOCK_MINUTES` | `10` | Break locks older than this | +| `CACHE_QUIET` | `false` | Suppress verbose output | + +## CI Integration + +### Required Tags + +Jobs using the cache-manager should use: +```yaml +tags: + - data-cache-storage # Has /nfs/ci-cache mounted + - fast # AMD 5950 builders (faster replays) +``` + +### Example Usage + +```yaml +prepare_haf_data: + image: registry.gitlab.syncad.com/hive/common-ci-configuration/docker-builder + script: + - | + # Try to get from cache + if cache-manager.sh get haf "$HAF_COMMIT" /data; then + echo "Cache hit!" + else + echo "Cache miss, running replay..." + ./run-replay.sh /data + cache-manager.sh put haf "$HAF_COMMIT" /data + fi +``` + +## Troubleshooting + +### Stuck Locks + +Check lock holder info: +```bash +cat /nfs/ci-cache/haf/.tar.lock.info +``` + +Manually break a lock: +```bash +rm -f /nfs/ci-cache/haf/.tar.lock{,.info} +``` + +### NFS Performance + +Tar archives are used instead of directories because writing a single large file to NFS is ~3x faster than many small files: +- `cp -a` 19GB/1844 files: 74s +- `tar` single archive: 25s + +### Cache Miss When Expected Hit + +1. Check if tar exists: `ls -la /nfs/ci-cache//.tar` +2. Check lock status: `flock -n -c "echo unlocked" || echo "locked"` +3. Check NFS mount: `mountpoint /nfs/ci-cache` + +### Incomplete Caches + +Look for `.tmp` files: +```bash +find /nfs/ci-cache -name "*.tmp" -mmin +60 +``` + +These are from interrupted PUT operations and can be safely removed. + +### Local Cache Management + +Local caches are stored as tar files in `/cache/`: +```bash +# List local caches +ls -la /cache/*.tar + +# Remove stale local caches (older than 7 days) +find /cache -name "*.tar" -mtime +7 -delete +``` + +Local caches are automatically populated when fetching from NFS and persist across jobs on the same builder. diff --git a/scripts/cache-manager.sh b/scripts/cache-manager.sh index 1011a76..a650c91 100755 --- a/scripts/cache-manager.sh +++ b/scripts/cache-manager.sh @@ -184,15 +184,17 @@ _get_paths() { local cache_key="$2" NFS_CACHE_DIR="${CACHE_NFS_PATH}/${cache_type}/${cache_key}" + NFS_TAR_FILE="${NFS_CACHE_DIR}.tar" + NFS_TAR_LOCK="${NFS_TAR_FILE}.lock" - # On NFS host, use NFS path as local path to avoid redundant copies + # Local cache is always a tar file (immutable, extracted fresh each time) + # On NFS host, local tar IS the NFS tar (same filesystem) if _is_nfs_host; then - LOCAL_CACHE_DIR="$NFS_CACHE_DIR" + LOCAL_TAR_FILE="$NFS_TAR_FILE" else - LOCAL_CACHE_DIR="${CACHE_LOCAL_PATH}/${cache_type}_${cache_key}" + LOCAL_TAR_FILE="${CACHE_LOCAL_PATH}/${cache_type}_${cache_key}.tar" fi - LOCK_FILE="${NFS_CACHE_DIR}/.lock" METADATA_FILE="${NFS_CACHE_DIR}/.metadata" LRU_INDEX="${CACHE_NFS_PATH}/.lru_index" GLOBAL_LOCK="${CACHE_NFS_PATH}/.global_lock" @@ -383,7 +385,8 @@ _build_haf_tar_excludes() { echo "$excludes" } -# GET: Check local, then NFS, copy to local if found on NFS +# GET: Check local tar, then NFS tar, extract to destination +# Local cache is immutable (tar file) - extracted fresh each time for safety cmd_get() { local cache_type="$1" local cache_key="$2" @@ -394,104 +397,62 @@ cmd_get() { local is_nfs_host=false _is_nfs_host && is_nfs_host=true - # 1. Check local cache first (on NFS host, this IS the NFS cache) - if [[ -d "$LOCAL_CACHE_DIR" ]]; then - _log "Cache hit: $LOCAL_CACHE_DIR" - if [[ "$LOCAL_CACHE_DIR" != "$local_dest" ]]; then - _log "Copying to destination: $local_dest" - mkdir -p "$(dirname "$local_dest")" - # Use cp -r instead of cp -a to avoid permission issues on NFS - # (cp -a tries to preserve ownership which can fail on NFS) - cp -r "$LOCAL_CACHE_DIR" "$local_dest" - else - _log "Destination is cache dir, no copy needed" - fi - # Restore pgdata permissions for HAF caches - if [[ "$cache_type" == "haf" ]]; then - _restore_pgdata_permissions "$local_dest" - fi - # Update LRU if NFS available - if _nfs_available; then - _update_lru "$cache_type" "$cache_key" || true - fi - return 0 - fi + # Determine which tar file to use (local cache or NFS) + local source_tar="" - # On NFS host, local and NFS are the same - if local miss, it's a miss - if [[ "$is_nfs_host" == "true" ]]; then - _log "NFS host cache miss: $NFS_CACHE_DIR" + # 1. Check local tar cache first (on NFS host, this IS the NFS tar) + if [[ -f "$LOCAL_TAR_FILE" ]]; then + _log "Local cache hit: $LOCAL_TAR_FILE" + source_tar="$LOCAL_TAR_FILE" + elif [[ "$is_nfs_host" == "true" ]]; then + # On NFS host, local and NFS are the same - if local miss, it's a miss + _log "NFS host cache miss: $NFS_TAR_FILE" return 1 - fi - - # 2. Check NFS cache (only for NFS clients) - if ! _nfs_available; then + elif ! _nfs_available; then _log "NFS not available, cache miss" return 1 - fi - - # Check for tar archive first (new format), then directory (legacy format) - local NFS_TAR_FILE="${NFS_CACHE_DIR}.tar" - local use_tar=false - - if [[ -f "$NFS_TAR_FILE" ]]; then - use_tar=true - _log "NFS cache hit (tar archive): $NFS_TAR_FILE" - elif [[ -d "$NFS_CACHE_DIR" ]]; then - _log "NFS cache hit (directory): $NFS_CACHE_DIR" + elif [[ -f "$NFS_TAR_FILE" ]]; then + _log "NFS cache hit: $NFS_TAR_FILE" + source_tar="$NFS_TAR_FILE" else - _log "NFS cache miss: $NFS_CACHE_DIR (no tar or dir)" + _log "Cache miss: $NFS_TAR_FILE" return 1 fi - # 3. Copy from NFS to local - NFS clients only + # 2. Extract tar to destination (always extract fresh for safety) mkdir -p "$local_dest" - if [[ "$use_tar" == "true" ]]; then - # Extract tar archive to local (fast: reading single file from NFS) - local NFS_TAR_LOCK="${NFS_TAR_FILE}.lock" - touch "$NFS_TAR_LOCK" 2>/dev/null || true - - local get_start_time=$(date +%s.%N) - if _flock_with_timeout "$CACHE_LOCK_TIMEOUT" -s "$NFS_TAR_LOCK" -c " - lock_acquired=\$(date +%s.%N) - echo \"[cache-manager] Shared lock acquired in \$(echo \"\$lock_acquired - $get_start_time\" | bc)s\" >&2 - - tar_size=\$(stat -c %s '$NFS_TAR_FILE' 2>/dev/null || echo 0) - tar_size_gb=\$(echo \"scale=2; \$tar_size / 1024 / 1024 / 1024\" | bc) - echo \"[cache-manager] Extracting tar archive (\${tar_size_gb}GB) to local: $local_dest\" >&2 - - extract_start=\$(date +%s.%N) - tar xf '$NFS_TAR_FILE' -C '$local_dest' - extract_end=\$(date +%s.%N) - extract_duration=\$(echo \"\$extract_end - \$extract_start\" | bc) - throughput=\$(echo \"scale=2; \$tar_size / 1024 / 1024 / \$extract_duration\" | bc 2>/dev/null || echo '?') - echo \"[cache-manager] Extraction completed in \${extract_duration}s (\${throughput} MB/s)\" >&2 - "; then - _log "Extracted tar archive successfully" - else - _error "Failed to extract tar archive" - return 1 - fi - else - # Legacy directory format - use tar pipe for faster reads - mkdir -p "$(dirname "$LOCK_FILE")" - touch "$LOCK_FILE" + local tar_lock="${source_tar}.lock" + touch "$tar_lock" 2>/dev/null || true - if _flock_with_timeout "$CACHE_LOCK_TIMEOUT" -s "$LOCK_FILE" -c " - echo '[cache-manager] Copying from NFS directory to local: $local_dest' >&2 - (cd '$NFS_CACHE_DIR' && tar cf - .) | (cd '$local_dest' && tar xf -) - "; then - _log "Copied from directory successfully" - else - _error "Failed to acquire shared lock" - return 1 - fi + local get_start_time=$(date +%s.%N) + if _flock_with_timeout "$CACHE_LOCK_TIMEOUT" -s "$tar_lock" -c " + lock_acquired=\$(date +%s.%N) + echo \"[cache-manager] Shared lock acquired in \$(echo \"\$lock_acquired - $get_start_time\" | bc)s\" >&2 + + tar_size=\$(stat -c %s '$source_tar' 2>/dev/null || echo 0) + tar_size_gb=\$(echo \"scale=2; \$tar_size / 1024 / 1024 / 1024\" | bc) + echo \"[cache-manager] Extracting (\${tar_size_gb}GB) to: $local_dest\" >&2 + + extract_start=\$(date +%s.%N) + tar xf '$source_tar' -C '$local_dest' + extract_end=\$(date +%s.%N) + extract_duration=\$(echo \"\$extract_end - \$extract_start\" | bc) + throughput=\$(echo \"scale=2; \$tar_size / 1024 / 1024 / \$extract_duration\" | bc 2>/dev/null || echo '?') + echo \"[cache-manager] Extraction completed in \${extract_duration}s (\${throughput} MB/s)\" >&2 + "; then + _log "Extracted successfully" + else + _error "Failed to extract tar archive" + return 1 fi - # Cache locally for future use (symlink to avoid copy) - if [[ "$LOCAL_CACHE_DIR" != "$local_dest" && ! -e "$LOCAL_CACHE_DIR" ]]; then - mkdir -p "$(dirname "$LOCAL_CACHE_DIR")" - ln -sf "$local_dest" "$LOCAL_CACHE_DIR" 2>/dev/null || true + # 3. Copy NFS tar to local cache for future use (skip if already local or on NFS host) + if [[ "$source_tar" == "$NFS_TAR_FILE" && "$LOCAL_TAR_FILE" != "$NFS_TAR_FILE" && ! -f "$LOCAL_TAR_FILE" ]]; then + mkdir -p "$(dirname "$LOCAL_TAR_FILE")" + if cp "$NFS_TAR_FILE" "$LOCAL_TAR_FILE" 2>/dev/null; then + _log "Cached locally: $LOCAL_TAR_FILE" + fi fi # Restore pgdata permissions for HAF caches @@ -503,7 +464,7 @@ cmd_get() { return 0 } -# PUT: Copy local cache to NFS +# PUT: Store cache as tar archive (NFS primary, local as fallback) cmd_put() { local cache_type="$1" local cache_key="$2" @@ -524,111 +485,109 @@ cmd_put() { local is_nfs_host=false _is_nfs_host && is_nfs_host=true - # On NFS host, LOCAL_CACHE_DIR == NFS_CACHE_DIR, so one copy does both + # On NFS host, storage is local so no network I/O, but we still use tar format if [[ "$is_nfs_host" == "true" ]]; then # Check if already exists - if [[ -d "$NFS_CACHE_DIR" && -f "$METADATA_FILE" ]]; then + if [[ -f "$NFS_TAR_FILE" ]]; then _log "Cache already exists on NFS host, updating timestamp" _update_lru "$cache_type" "$cache_key" return 0 fi - # Copy directly to NFS path (which is local storage on this host) - # Use tar streaming for consistency (though local-to-local is already fast) - if [[ "$local_source" != "$NFS_CACHE_DIR" ]]; then - _log "Storing cache on NFS host: $NFS_CACHE_DIR" - mkdir -p "$NFS_CACHE_DIR" - touch "$LOCK_FILE" - _flock_with_timeout "$CACHE_LOCK_TIMEOUT" -x "$LOCK_FILE" -c " - (cd '$local_source' && tar cf - .) | (cd '$NFS_CACHE_DIR' && tar xf -) - " || { _error "Failed to store cache"; return 1; } - else - _log "Source is already at NFS path, no copy needed" - mkdir -p "$(dirname "$METADATA_FILE")" + # Build exclusions + local tar_excludes="" + if [[ "$cache_type" == "hive" ]]; then + if [[ -d "${local_source}/datadir/blockchain" ]]; then + tar_excludes="--exclude=./datadir/blockchain" + _log "Excluding datadir/blockchain" + fi + elif [[ "$cache_type" == "haf" || "$cache_type" == "haf_sync" ]]; then + tar_excludes=$(_build_haf_tar_excludes "$local_source") fi - _write_metadata "$cache_type" "$cache_key" "$NFS_CACHE_DIR" - _update_lru "$cache_type" "$cache_key" - _log "Cache stored successfully on NFS host" - _maybe_cleanup & - return 0 - fi + # Create tar archive (local I/O on NFS host, still fast) + _log "Storing cache on NFS host: $NFS_TAR_FILE" + mkdir -p "$(dirname "$NFS_TAR_FILE")" + touch "$NFS_TAR_LOCK" - # NFS client path: prefer NFS, use local cache only as fallback - # Rationale: Local cache is only useful on THIS builder. NFS is shared across all builders. - # We skip local copy to save time - if NFS push succeeds, create symlink for local reference. - - # Check if source is already on NFS - no need to copy/tar - if [[ "$local_source" == "$CACHE_NFS_PATH"/* ]]; then - _log "Source is already on NFS: $local_source" - # Create symlink from expected cache path to actual location if different - if [[ "$local_source" != "$NFS_CACHE_DIR" && ! -e "$NFS_CACHE_DIR" ]]; then - ln -sf "$local_source" "$NFS_CACHE_DIR" 2>/dev/null || true + # shellcheck disable=SC2086 + if ! _flock_with_timeout "$CACHE_LOCK_TIMEOUT" -x "$NFS_TAR_LOCK" -c " + tar cf '$NFS_TAR_FILE.tmp' $tar_excludes -C '$local_source' . + mv '$NFS_TAR_FILE.tmp' '$NFS_TAR_FILE' + "; then + _error "Failed to store cache" + return 1 fi + + # Write metadata + mkdir -p "$NFS_CACHE_DIR" _write_metadata "$cache_type" "$cache_key" "$local_source" _update_lru "$cache_type" "$cache_key" - _log "Cache registered (source already on NFS)" + _log "Cache stored successfully on NFS host" + _maybe_cleanup & return 0 fi - if ! _nfs_available; then - # NFS unavailable - use local cache as fallback - if [[ "$LOCAL_CACHE_DIR" != "$local_source" ]]; then - _log "NFS not available, caching locally: $LOCAL_CACHE_DIR" - mkdir -p "$(dirname "$LOCAL_CACHE_DIR")" - cp -a "$local_source" "$LOCAL_CACHE_DIR" 2>/dev/null || true - fi - _log "Cached locally only (NFS unavailable)" - return 0 - fi + # NFS client path: create local tar first, then push to NFS - # Check if already exists on NFS (either as directory or tar archive) - local NFS_TAR_FILE="${NFS_CACHE_DIR}.tar" - if [[ -f "$NFS_TAR_FILE" ]] || { [[ -d "$NFS_CACHE_DIR" ]] && [[ -f "$METADATA_FILE" ]]; }; then + # Check if already exists on NFS + if _nfs_available && [[ -f "$NFS_TAR_FILE" ]]; then _log "Cache already exists on NFS, updating timestamp" + # Ensure we have local copy too + if [[ ! -f "$LOCAL_TAR_FILE" ]]; then + mkdir -p "$(dirname "$LOCAL_TAR_FILE")" + cp "$NFS_TAR_FILE" "$LOCAL_TAR_FILE" 2>/dev/null || true + fi _update_lru "$cache_type" "$cache_key" return 0 fi - # Copy to NFS as single tar archive for 3x faster writes - # Benchmark: cp -a 19GB/1844 files = 74s, tar archive = 25s - # Writing single large file to NFS is much faster than many small files - mkdir -p "$(dirname "$NFS_TAR_FILE")" - local NFS_TAR_LOCK="${NFS_TAR_FILE}.lock" - touch "$NFS_TAR_LOCK" - - # Build exclusions for caches to reduce size and speed up NFS writes - # - hive caches: exclude blockchain (~1.7GB) - services use /blockchain/block_log_5m (local mount) - # - HAF caches: exclude blockchain (~1.7GB) - WAL files are kept for safe recovery + # Build exclusions local tar_excludes="" if [[ "$cache_type" == "hive" ]]; then - # Exclude blockchain - CI runners mount /blockchain locally via services_volumes if [[ -d "${local_source}/datadir/blockchain" ]]; then tar_excludes="--exclude=./datadir/blockchain" - _log "Excluding datadir/blockchain (services use local /blockchain/block_log_5m)" + _log "Excluding datadir/blockchain" fi elif [[ "$cache_type" == "haf" || "$cache_type" == "haf_sync" ]]; then tar_excludes=$(_build_haf_tar_excludes "$local_source") fi - # Write exclusions to temp file for use in subshell - local excludes_file="" - if [[ -n "$tar_excludes" ]]; then - excludes_file=$(mktemp) - echo "$tar_excludes" > "$excludes_file" + # Step 1: Create local tar (always, this is our primary cache) + _log "Creating local cache: $LOCAL_TAR_FILE" + mkdir -p "$(dirname "$LOCAL_TAR_FILE")" + + local tar_start=$(date +%s.%N) + # shellcheck disable=SC2086 + if ! tar cf "$LOCAL_TAR_FILE.tmp" $tar_excludes -C "$local_source" .; then + _error "Failed to create local tar" + rm -f "$LOCAL_TAR_FILE.tmp" + return 1 + fi + mv "$LOCAL_TAR_FILE.tmp" "$LOCAL_TAR_FILE" + + local tar_end=$(date +%s.%N) + local tar_duration=$(echo "$tar_end - $tar_start" | bc) + local tar_size=$(stat -c %s "$LOCAL_TAR_FILE" 2>/dev/null || echo 0) + local tar_size_gb=$(echo "scale=2; $tar_size / 1024 / 1024 / 1024" | bc) + _log "Local tar created: ${tar_size_gb}GB in ${tar_duration}s" + + # Step 2: Push to NFS (if available) + if ! _nfs_available; then + _log "NFS not available, cached locally only" + return 0 fi + mkdir -p "$(dirname "$NFS_TAR_FILE")" + touch "$NFS_TAR_LOCK" + # Check for stale locks before attempting to acquire _check_stale_lock "$NFS_TAR_LOCK" local lock_start_time=$(date +%s.%N) - _log "Attempting to acquire lock: $NFS_TAR_LOCK" + _log "Pushing to NFS: $NFS_TAR_FILE" if ! _flock_with_timeout "$CACHE_LOCK_TIMEOUT" -x "$NFS_TAR_LOCK" -c " - # Record lock acquisition time - lock_acquired=\$(date +%s.%N) - echo \"[cache-manager] Lock acquired in \$(echo \"\$lock_acquired - $lock_start_time\" | bc)s\" >&2 - # Write lock holder info for debugging cat > '${NFS_TAR_LOCK}.info' 2>/dev/null <&2 exit 0 fi - tar_start=\$(date +%s.%N) - echo '[cache-manager] Creating tar archive on NFS: $NFS_TAR_FILE' >&2 - # Read exclusions from temp file if present - excludes='' - if [[ -f '$excludes_file' ]]; then - excludes=\$(cat '$excludes_file') - fi - # Write tar archive directly to NFS (single file = fast) - # shellcheck disable=SC2086 - tar cf '$NFS_TAR_FILE.tmp' \$excludes -C '$local_source' . - tar_end=\$(date +%s.%N) - tar_duration=\$(echo \"\$tar_end - \$tar_start\" | bc) - - # Get file size for throughput calculation - tar_size=\$(stat -c %s '$NFS_TAR_FILE.tmp' 2>/dev/null || echo 0) - tar_size_gb=\$(echo \"scale=2; \$tar_size / 1024 / 1024 / 1024\" | bc) - throughput=\$(echo \"scale=2; \$tar_size / 1024 / 1024 / \$tar_duration\" | bc 2>/dev/null || echo '?') - - echo \"[cache-manager] Tar completed: \${tar_size_gb}GB in \${tar_duration}s (\${throughput} MB/s)\" >&2 - + # Copy local tar to NFS + copy_start=\$(date +%s.%N) + cp '$LOCAL_TAR_FILE' '$NFS_TAR_FILE.tmp' mv '$NFS_TAR_FILE.tmp' '$NFS_TAR_FILE' + copy_end=\$(date +%s.%N) + + copy_duration=\$(echo \"\$copy_end - \$copy_start\" | bc) + throughput=\$(echo \"scale=2; $tar_size / 1024 / 1024 / \$copy_duration\" | bc 2>/dev/null || echo '?') + echo \"[cache-manager] NFS push completed in \${copy_duration}s (\${throughput} MB/s)\" >&2 # Clean up lock info file rm -f '${NFS_TAR_LOCK}.info' 2>/dev/null || true - - total_duration=\$(echo \"\$(date +%s.%N) - $lock_start_time\" | bc) - echo \"[cache-manager] Total put operation: \${total_duration}s\" >&2 "; then - [[ -n "$excludes_file" ]] && rm -f "$excludes_file" - _error "Failed to acquire exclusive lock" - return 1 + _log "WARNING: Failed to push to NFS, but local cache exists" + # Don't fail - we have local cache fi - [[ -n "$excludes_file" ]] && rm -f "$excludes_file" - - # Write metadata next to tar file - local TAR_METADATA="${NFS_TAR_FILE%.tar}/.metadata" - mkdir -p "$(dirname "$TAR_METADATA")" - _write_metadata "$cache_type" "$cache_key" "$local_source" - mv "$METADATA_FILE" "$TAR_METADATA" 2>/dev/null || true - # Create local symlink to source for future local hits (instant, no copy) - if [[ "$LOCAL_CACHE_DIR" != "$local_source" && ! -e "$LOCAL_CACHE_DIR" ]]; then - mkdir -p "$(dirname "$LOCAL_CACHE_DIR")" - ln -sf "$local_source" "$LOCAL_CACHE_DIR" 2>/dev/null || true - _log "Created local cache symlink: $LOCAL_CACHE_DIR -> $local_source" + # Write metadata next to tar file (if NFS push succeeded) + if [[ -f "$NFS_TAR_FILE" ]]; then + local TAR_METADATA="${NFS_TAR_FILE%.tar}/.metadata" + mkdir -p "$(dirname "$TAR_METADATA")" + _write_metadata "$cache_type" "$cache_key" "$local_source" + mv "$METADATA_FILE" "$TAR_METADATA" 2>/dev/null || true + _update_lru "$cache_type" "$cache_key" fi - _update_lru "$cache_type" "$cache_key" - _log "Cache stored successfully (tar archive)" + _log "Cache stored successfully" # Trigger async cleanup check _maybe_cleanup & @@ -761,10 +698,12 @@ cmd_cleanup() { continue fi - local entry_path="$CACHE_NFS_PATH/$entry" + local entry_dir="$CACHE_NFS_PATH/$entry" + local entry_tar="${entry_dir}.tar" + local entry_tar_lock="${entry_tar}.lock" - # Skip if doesn't exist - [[ -d "$entry_path" ]] || continue + # Skip if doesn't exist (tar file is the primary format) + [[ -f "$entry_tar" ]] || continue # Check if should remove (age or size) local should_remove=false @@ -779,15 +718,15 @@ cmd_cleanup() { if [[ "$should_remove" == "true" ]]; then # Check if locked (skip if in use) - local lock_file="$entry_path/.lock" - if [[ -f "$lock_file" ]] && ! flock -n "$lock_file" -c "true" 2>/dev/null; then + if [[ -f "$entry_tar_lock" ]] && ! flock -n "$entry_tar_lock" -c "true" 2>/dev/null; then _log "Skipping $entry - currently locked" continue fi - local entry_size=$(du -sb "$entry_path" 2>/dev/null | cut -f1 || echo 0) + local entry_size=$(stat -c %s "$entry_tar" 2>/dev/null || echo 0) _log "Removing: $entry (${entry_size} bytes)" - rm -rf "$entry_path" + rm -f "$entry_tar" "$entry_tar_lock" "${entry_tar_lock}.info" + rm -rf "$entry_dir" # Remove metadata directory if exists total_size=$((total_size - entry_size)) removed=$((removed + 1)) @@ -827,12 +766,13 @@ cmd_list() { local cache_type="${1:-}" echo "=== Local Caches (${CACHE_LOCAL_PATH}) ===" - local pattern="${CACHE_LOCAL_PATH}/${cache_type}*" - for dir in $pattern; do - [[ -d "$dir" ]] || continue - local size=$(du -sh "$dir" 2>/dev/null | cut -f1 || echo "?") - local mtime=$(stat -c %y "$dir" 2>/dev/null | cut -d. -f1 || echo "?") - echo " $(basename "$dir") - ${size} - ${mtime}" + local pattern="${CACHE_LOCAL_PATH}/${cache_type}*.tar" + for tarfile in $pattern; do + [[ -f "$tarfile" ]] || continue + local size=$(du -sh "$tarfile" 2>/dev/null | cut -f1 || echo "?") + local mtime=$(stat -c %y "$tarfile" 2>/dev/null | cut -d. -f1 || echo "?") + local key=$(basename "$tarfile" .tar) + echo " $key - ${size} - ${mtime}" done if _nfs_available; then @@ -842,13 +782,18 @@ cmd_list() { [[ -n "$cache_type" ]] && nfs_path="$CACHE_NFS_PATH/$cache_type" if [[ -d "$nfs_path" ]]; then - for dir in "$nfs_path"/*/; do - [[ -d "$dir" ]] || continue - local size=$(du -sh "$dir" 2>/dev/null | cut -f1 || echo "?") - local key=$(basename "$dir") + # List tar archives (current format) + for tarfile in "$nfs_path"/*.tar; do + [[ -f "$tarfile" ]] || continue + local size=$(du -sh "$tarfile" 2>/dev/null | cut -f1 || echo "?") + local key=$(basename "$tarfile" .tar) + local mtime=$(stat -c %y "$tarfile" 2>/dev/null | cut -d. -f1 || echo "?") + local meta_dir="${tarfile%.tar}" local meta="" - if [[ -f "$dir/.metadata" ]]; then - meta=$(jq -r '.created_at // "?"' "$dir/.metadata" 2>/dev/null || echo "?") + if [[ -f "$meta_dir/.metadata" ]]; then + meta=$(jq -r '.created_at // "?"' "$meta_dir/.metadata" 2>/dev/null || echo "?") + else + meta="$mtime" fi echo " $key - ${size} - ${meta}" done -- GitLab From dbb7b637d2b48c5b1edb8f5620b2ce0566136dff Mon Sep 17 00:00:00 2001 From: Dan Notestein Date: Sun, 28 Dec 2025 18:19:43 -0500 Subject: [PATCH 2/4] Add common-ci-images.md documenting Docker images and usage status --- docs/common-ci-images.md | 272 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 272 insertions(+) create mode 100644 docs/common-ci-images.md diff --git a/docs/common-ci-images.md b/docs/common-ci-images.md new file mode 100644 index 0000000..2f30954 --- /dev/null +++ b/docs/common-ci-images.md @@ -0,0 +1,272 @@ +# Common CI Images + +Docker images built by common-ci-configuration for use across Hive blockchain CI/CD pipelines. + +## Image Registry + +All images are published to: +``` +registry.gitlab.syncad.com/hive/common-ci-configuration/: +``` + +## Build Images + +### docker-builder + +**Base:** Alpine (docker:26.1.4-cli) + +CI image for building Docker images using BuildKit. Runs as non-root user with sudo access. + +**Includes:** bash, git, coreutils, curl, sudo, util-linux (for NFS flock support) + +**Used by:** +- `prepare_hived_image` jobs in hive/haf +- `prepare_haf_data` replay jobs +- Any job that builds Docker images via `docker buildx` + +**Example:** +```yaml +build_image: + image: registry.gitlab.syncad.com/hive/common-ci-configuration/docker-builder:latest + services: + - name: registry.gitlab.syncad.com/hive/common-ci-configuration/docker-dind:latest + alias: docker +``` + +### docker-dind + +**Base:** Alpine (docker:26.1.4-dind) + +Docker-in-Docker service image. Used as a sidecar service for jobs that need to build/run Docker containers. + +**Includes:** util-linux (for NFS flock support in cache operations) + +**Note:** Exposes only port 2376 to work around GitLab Runner healthcheck issues. + +### ci-base-image + +**Base:** Ubuntu 24.04 (phusion/baseimage) + +Full build environment for hive/HAF C++ compilation and Python testing. + +**Python:** 3.14 + +**Includes:** +- C++ build toolchain (cmake, ninja, ccache) +- Python 3.14 with poetry +- Docker CLI and buildx +- PostgreSQL client libraries (libpq-dev) +- Compression libraries (zstd, snappy) + +**Current version:** `ubuntu24.04-py3.14-2` + +**Used by:** hive and HAF build/test pipelines that need the full toolchain. + +### emsdk + +**Base:** Debian (emscripten/emsdk) + +WebAssembly build environment with Emscripten toolchain and pre-compiled dependencies. + +**Includes:** +- Emscripten SDK (version configured in docker-bake.hcl) +- Node.js 22.x with pnpm +- Pre-compiled WASM libraries: Boost, OpenSSL, secp256k1 +- Build tools: ninja, autoconf, libtool, protobuf + +**Current version:** `4.0.18-1` + +**Used by:** wax and other WASM projects for building JavaScript/TypeScript packages. + +## Runtime Images + +### python + +**Base:** Debian (python:3.12.9-slim-bookworm) + +**Python:** 3.12.9 + +Lightweight Python runtime with poetry for running Python applications. + +**Used by:** Python-based services and test runners. + +### python_runtime + +**Base:** Ubuntu 24.04 + +**Python:** 3.12 + +Minimal Python 3.12 runtime environment. + +**Current version:** `3.12-u24.04-1` + +### python_development + +**Base:** Ubuntu 24.04 (same Dockerfile as python_runtime, different target) + +**Python:** 3.12 + +Python development environment with additional tools for testing and development. + +### python-scripts + +**Base:** Debian (python:3.12.2) + +**Python:** 3.12.2 + +Contains Python utilities for CI operations: +- `delete-image.py` - GitLab registry cleanup +- `remove-buildkit-cache.py` - BuildKit cache management + +**Used by:** Registry cleanup jobs. + +## Service Images + +### psql + +**Base:** Alpine (ghcr.io/alphagov/paas/psql) + +PostgreSQL client for database operations in CI jobs. + +**Current version:** `14-1` + +**Used by:** Jobs that need to run SQL queries or manage PostgreSQL databases. + +### postgrest + +**Base:** Alpine (postgrest/postgrest) + +PostgREST API server for exposing PostgreSQL as REST API. + +**Current version:** `v12.0.2` + +**Used by:** API testing and HAF API node deployments. + +### nginx + +**Base:** Alpine (openresty/openresty:alpine) + +OpenResty (nginx + Lua) for reverse proxy and API gateway. + +**Used by:** Frontend deployments and API proxying. + +## Utility Images + +### alpine + +**Base:** Alpine 3.21.3 + +Minimal Alpine image mirrored to GitLab registry. + +**Used by:** Simple utility jobs, base for other images. + +### dockerfile + +**Base:** docker/dockerfile + +BuildKit frontend for advanced Dockerfile features. + +**Current version:** `1.11` + +### benchmark-test-runner + +**Base:** Alpine 3.17 + +**Python:** 3.x (Alpine system Python) + +JMeter-based benchmark test runner. + +**Used by:** Performance testing jobs. + +### tox-test-runner + +**Base:** Alpine (python:3.11-alpine) + +**Python:** 3.11 + +Python tox test runner for multi-version Python testing. + +**Used by:** Python package testing across multiple Python versions. + +## Image Usage Status + +### Images Used in CI Templates + +| Image | Used in Templates | Purpose | +|-------|-------------------|---------| +| `python-scripts` | Yes | Registry cleanup utilities | +| `docker-builder` | Yes | Building Docker images | +| `docker-dind` | Yes | Docker-in-Docker service | +| `emsdk` | Yes | WASM builds | +| `tox-test-runner` | Yes | Python multi-version testing | +| `benchmark-test-runner` | Yes | JMeter performance tests | +| `alpine` | Yes | Base image for various jobs | +| `nginx` | Yes | Reverse proxy / API gateway | +| `postgrest` | Yes | REST API for PostgreSQL | +| `psql` | Yes | PostgreSQL client | +| `ci-base-image` | No | Used directly by hive/haf pipelines | +| `dockerfile` | No | BuildKit frontend | + +### Potentially Redundant Images + +These images are built but do not appear to be used in templates or downstream projects: + +| Image | Python | Notes | +|-------|--------|-------| +| `python` | 3.12.9 | Slim Debian runtime - may be replaced by ci-base-image | +| `python_runtime` | 3.12 | Minimal Ubuntu runtime - no known usage | +| `python_development` | 3.12 | Ubuntu dev environment - may be replaced by ci-base-image | + +Before removing these images, verify: +1. Check GitLab registry for recent pulls +2. Search for usage in external projects not in the main repos + +## Python Version Summary + +| Image | Python Version | Notes | +|-------|----------------|-------| +| ci-base-image | 3.14 | Latest Python for hive/HAF testing | +| python | 3.12.9 | Slim Debian-based runtime | +| python_runtime | 3.12 | Minimal Ubuntu runtime | +| python_development | 3.12 | Ubuntu with dev tools | +| python-scripts | 3.12.2 | CI utilities | +| tox-test-runner | 3.11 | Multi-version testing | +| benchmark-test-runner | 3.x | Alpine system Python | + +## Version Management + +Image versions are defined in `docker-bake.hcl`: + +| Variable | Current Value | Description | +|----------|---------------|-------------| +| `EMSCRIPTEN_VERSION` | 4.0.18 | Emscripten SDK version | +| `PYTHON_VERSION` | 3.12.9-slim-bookworm | Python base image version | +| `PYTHON_RUNTIME_VERSION` | 3.12-u24.04-1 | Python runtime version | +| `CI_BASE_IMAGE_VERSION` | ubuntu24.04-py3.14-2 | CI base image version | +| `PSQL_IMAGE_VERSION` | 14-1 | PostgreSQL client version | +| `POSTGREST_VERSION` | v12.0.2 | PostgREST version | +| `ALPINE_VERSION` | 3.21.3 | Alpine base version | +| `DOCKERFILE_IMAGE_VERSION` | 1.11 | Dockerfile frontend version | + +## Building Images Locally + +```bash +# Build a specific target +docker buildx bake + +# Build with custom tag +docker buildx bake --set *.tags=myregistry/myimage:mytag + +# Available targets: +# docker-builder, docker-dind, ci-base-image, emsdk, python, python_runtime, +# python_development, python-scripts, psql, postgrest, nginx, alpine, +# dockerfile, benchmark-test-runner, tox-test-runner +``` + +## NFS Compatibility + +The following Alpine-based images include `util-linux` for proper NFS flock support: +- `docker-builder` +- `docker-dind` + +This is required for cache-manager.sh operations on NFS-mounted cache directories. BusyBox flock (Alpine default) returns "Bad file descriptor" on NFS mounts. -- GitLab From 4607c850ea34152ee8c9ea6cfad23ac629507e93 Mon Sep 17 00:00:00 2001 From: Dan Notestein Date: Sun, 28 Dec 2025 18:21:51 -0500 Subject: [PATCH 3/4] Remove BusyBox flock fallback, require util-linux BusyBox flock returns 'Bad file descriptor' on NFS mounts and cannot be used reliably. Instead of silently falling back to a retry loop, fail fast with a clear error message explaining how to fix it. - Add _check_flock_support() that detects BusyBox and exits with error - Check runs at startup for get/put/cleanup commands - Simplify _flock_with_timeout() to just use -w flag directly - Update documentation to reflect util-linux requirement --- docs/cache-manager.md | 20 ++++++++------------ scripts/cache-manager.sh | 37 +++++++++++++++---------------------- 2 files changed, 23 insertions(+), 34 deletions(-) diff --git a/docs/cache-manager.md b/docs/cache-manager.md index 108efc1..cf61908 100644 --- a/docs/cache-manager.md +++ b/docs/cache-manager.md @@ -142,22 +142,18 @@ pipeline_id=98765 This helps diagnose stale locks. -### BusyBox Compatibility +### flock Requirements -Alpine-based images (docker-builder, docker-dind) may have BusyBox flock which lacks timeout support. The script detects this and falls back to a retry loop: +**util-linux is required.** BusyBox flock does not work with NFS - it returns "Bad file descriptor" when locking NFS files. -```bash -# GNU coreutils flock -flock -x -w 120 lockfile command - -# BusyBox fallback (retry every 5s for up to timeout) -while [[ $elapsed -lt $timeout ]]; do - flock -x -n lockfile command && return 0 - sleep 5 -done +The script checks for proper flock support at startup and fails with a clear error if BusyBox flock is detected: + +``` +[cache-manager] ERROR: BusyBox flock detected - this does not work with NFS! +[cache-manager] ERROR: Install util-linux package: apk add util-linux (Alpine) ``` -**Note:** util-linux must be installed in Alpine images for proper NFS flock support. BusyBox flock returns "Bad file descriptor" on NFS mounts. +The `docker-builder` and `docker-dind` images already include util-linux. ## Handling Failures and Canceled Jobs diff --git a/scripts/cache-manager.sh b/scripts/cache-manager.sh index a650c91..2614578 100755 --- a/scripts/cache-manager.sh +++ b/scripts/cache-manager.sh @@ -29,13 +29,19 @@ set -euo pipefail -# Detect flock capabilities (BusyBox vs GNU coreutils) -# BusyBox flock doesn't support -w (timeout), only -n (nonblock) -_flock_supports_timeout() { - flock --help 2>&1 | grep -q -- '-w' 2>/dev/null +# Check for proper flock support (util-linux, not BusyBox) +# BusyBox flock returns "Bad file descriptor" on NFS mounts and lacks -w timeout support +_check_flock_support() { + # Check if flock supports -w (timeout) - util-linux does, BusyBox doesn't + if ! flock --help 2>&1 | grep -q -- '-w'; then + _error "BusyBox flock detected - this does not work with NFS!" + _error "Install util-linux package: apk add util-linux (Alpine) or apt install util-linux (Debian)" + _error "Docker images docker-builder and docker-dind should already have util-linux installed." + exit 1 + fi } -# Wrapper for flock that handles BusyBox compatibility +# Wrapper for flock with timeout # Usage: _flock_with_timeout # mode: -s (shared) or -x (exclusive) _flock_with_timeout() { @@ -44,23 +50,7 @@ _flock_with_timeout() { local lockfile="$3" shift 3 - if _flock_supports_timeout; then - # GNU coreutils flock - use -w for timeout - flock "$mode" -w "$timeout" "$lockfile" "$@" - else - # BusyBox flock - no timeout support, use -n (non-blocking) with retry loop - local elapsed=0 - local interval=5 - while [[ $elapsed -lt $timeout ]]; do - if flock "$mode" -n "$lockfile" "$@" 2>/dev/null; then - return 0 - fi - sleep "$interval" - elapsed=$((elapsed + interval)) - done - _error "Timeout waiting for lock after ${timeout}s" - return 1 - fi + flock "$mode" -w "$timeout" "$lockfile" "$@" } # Configuration with defaults @@ -872,13 +862,16 @@ shift case "$cmd" in get) [[ $# -lt 3 ]] && { _error "get requires: "; exit 1; } + _check_flock_support cmd_get "$@" ;; put) [[ $# -lt 3 ]] && { _error "put requires: "; exit 1; } + _check_flock_support cmd_put "$@" ;; cleanup) + _check_flock_support cmd_cleanup "$@" ;; list) -- GitLab From ac636630dc059f611a4f2435883ee43075ffa338 Mon Sep 17 00:00:00 2001 From: Dan Notestein Date: Sun, 28 Dec 2025 19:37:54 -0500 Subject: [PATCH 4/4] Add block log storage documentation to cache-manager.md --- docs/cache-manager.md | 64 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) diff --git a/docs/cache-manager.md b/docs/cache-manager.md index cf61908..ba78315 100644 --- a/docs/cache-manager.md +++ b/docs/cache-manager.md @@ -352,3 +352,67 @@ find /cache -name "*.tar" -mtime +7 -delete ``` Local caches are automatically populated when fetching from NFS and persist across jobs on the same builder. + +## Block Log Storage + +Block logs are static, read-only data stored separately from the cache-manager system. They are used as input for replay operations. + +### Storage Location + +Each builder has block logs at `/storage1/blockchain/` (symlinked to `/blockchain/`): + +``` +/blockchain/ +├── block_log_5m/ # 3.4G - 5 million blocks (mainnet) +│ ├── block_log +│ ├── block_log.artifacts +│ └── block_log_part.* +└── block_log_5m_mirrornet/ # 1.7G - 5 million blocks (mirrornet) + ├── block_log + └── block_log.artifacts +``` + +### Usage in CI + +CI jobs reference these via variables defined in `.gitlab-ci.yml`: + +```yaml +variables: + BLOCK_LOG_SOURCE_DIR_5M: /blockchain/block_log_5m + BLOCK_LOG_SOURCE_DIR_MIRRORNET_5M: /blockchain/block_log_5m_mirrornet +``` + +Jobs mount the block_log directory into containers for replay: + +```yaml +script: + - | + docker run \ + -v $BLOCK_LOG_SOURCE_DIR_5M:/blockchain:ro \ + hived --replay +``` + +### Why Block Logs Are Not Cached + +Block logs are excluded from cache-manager for several reasons: + +1. **Static data**: Block logs don't change - they're fixed test datasets +2. **Already local**: Every builder has a local copy, no NFS fetch needed +3. **Read-only mounts**: Jobs mount them read-only, preventing corruption +4. **Size efficiency**: The 5M block logs (3.4G) are small enough to store everywhere + +The `put` command automatically excludes `datadir/blockchain` from HAF/hive caches since jobs use the local block_log mount instead of copying block data into each cache. + +### Maintenance + +Block logs are manually updated when new test data is needed. No automatic cleanup is required since they're static. + +```bash +# Check block_log on a builder +ls -la /blockchain/block_log_5m/ + +# Verify all builders have consistent data +for i in 5 6 7 8 9 10 11; do + ssh hive-builder-$i 'ls -la /blockchain/block_log_5m/' +done +``` -- GitLab