Skip to content

fix for very rare webserver random crash

Marcin requested to merge mi_fix_for_webserver_random_crash into develop

The problem is between stopping webserver and starting accepting tcp connection by boost::asio. There is async_accept method used to start servising incoming tcp connection, it means that starting accept is delayed by the asio scheduler. In some tests the hived process is started and then almost imediatly stopped. If the CI runner is heavy loaded, then it happens that initialization of accepting tcp connections is still in progress when the hived and its webserver are being closed. Now, when webserver is being closed, then it calls stop_listen() method on all its websocket services, this will lead to close and deinitialize various internal objects, in oour case the problem is with closing acceptor, in details it cleans boost::asio::detail::reactive_socket_service_base and its epoll reactor:

boost/asio/detail/impl/epoll_reactor.ipp
void epoll_reactor::cleanup_descriptor_data(
    per_descriptor_data& descriptor_data)
{
  if (descriptor_data)
  {
    free_descriptor_state(descriptor_data);
    descriptor_data = 0;
  }
}

But delayed and pending accept initialization needs it(it is a descriptor_data passed by reference here):

boost/asio/detail/impl/epoll_reactor.ipp
void epoll_reactor::start_op(int op_type, socket_type descriptor,
    epoll_reactor::per_descriptor_data& descriptor_data, reactor_op* op,
    bool is_continuation, bool allow_speculative)
{
  if (!descriptor_data)
  {
    op->ec_ = boost::asio::error::bad_descriptor;
    post_immediate_completion(op, is_continuation);
    return;
  }

  mutex::scoped_lock descriptor_lock(descriptor_data->mutex_);

  if (descriptor_data->shutdown_)
  {
    post_immediate_completion(op, is_continuation);
    return;
  }

Here we got classical race conditions, the thread which calls the accept initialization checks if 'descriptor_data' is not null, but stop_listening() running in another thread set it to null after the check, then we got crash in line 'if (descriptor_data->shutdown_)'.

To easly repeat the issue a sleep needs to be added:

boost/asio/detail/impl/epoll_reactor.ipp
void epoll_reactor::start_op(int op_type, socket_type descriptor,
    epoll_reactor::per_descriptor_data& descriptor_data, reactor_op* op,
    bool is_continuation, bool allow_speculative)
{
  if (!descriptor_data)
  {
    op->ec_ = boost::asio::error::bad_descriptor;
    post_immediate_completion(op, is_continuation);
    return;
  }

  sleep(5);
  mutex::scoped_lock descriptor_lock(descriptor_data->mutex_);

  if (descriptor_data->shutdown_)
  {
    post_immediate_completion(op, is_continuation);
    return;
  }

and haf system test could be run to cause the crash: pytest -k test_dump_load_instance_scripts[30-40-40-30-40-1-0]

The solution is to resign form stop_listening() and allow websocket server to stop and destroy their internals with stop() method.

Edited by Marcin

Merge request reports