Skip to content

Fix node connection losing in beem tests

Mateusz Żebrak requested to merge mzebrak/beem-tests-random-fail-debug into develop

This MR requires test_tools submodule to be merged first, and updating the references to them as a sub-module: test-tools!81 (merged)

When CI/CD runners are overloaded, there are situations when BEEM tests randomly fail.

In the log output, we can see that this is due to the loss of connection to the node because we are exceeding the number of available attempts:

  • Lost connection or internal error on node: http://127.0.0.1:36753 (99/100)
  • Lost connection or internal error on node: http://127.0.0.1:36753 (100/100)

This behavior could be observed in the job below: https://gitlab.syncad.com/hive/hive/-/jobs/359592

This is because the runner's processor is overloaded and the node is unable to respond to the query. Fortunately, BEEM has a retry mechanism, but it also has a counter that causes it to stop after a certain number of tries. (num_retries=100 by default)

Between subsequent attempts, we see that other log messages appear, i.e. we lose connection with the node only from time to time.

When the number of available attempts is set to -1 (num_retries=-1), BEEM will try to reconnect with a node infinitely. This is a favorable situation for us because it will allow us to pass the tests with overloaded runners, and if the connection could not be recovered within a certain time, we still have a timeout mechanism.

Edited by Mateusz Żebrak

Merge request reports