Remove replay_data_copy job and simplify sync

Summary

  • Remove redundant replay_data_copy job entirely (saves ~5 min per pipeline)
  • Refactor sync job to handle HAF cache fetching directly
  • Use copy_datadir.sh from HAF submodule (same pattern as balance_tracker)
  • Remove broken cache-manager copy logic that was causing failures

Changes

  1. Deleted replay_data_copy job - was duplicating work that sync already does
  2. Simplified sync before_script:
    • Initialize HAF submodule (recursive for nested hive submodule)
    • Fetch HAF cache from NFS if needed via cache-manager
    • Use copy_datadir.sh for data copy (handles permissions properly)
  3. Updated dependent jobs:
    • cleanup_pipeline_cache: removed replay_data_copy from needs
    • e2e_benchmark_on_postgrest: removed replay_data_copy from needs, removed unused HIVED_UID

Why

The previous flow had:

  1. prepare_haf_data → creates HAF cache
  2. replay_data_copy → copies HAF cache to pipeline-specific location
  3. sync → tries to copy AGAIN using broken cache-manager copy

Now (like balance_tracker):

  1. prepare_haf_data → creates HAF cache
  2. sync → fetches from NFS if needed, uses copy_datadir.sh, runs sync

Testing

Pipeline will verify the refactored flow works correctly.

Merge request reports

Loading