fix for very rare webserver random crash
The problem is between stopping webserver and starting accepting tcp
connection by boost::asio. There is async_accept method used to start
servising incoming tcp connection, it means that starting accept is
delayed by the asio scheduler.
In some tests the hived process is started and then almost imediatly stopped.
If the CI runner is heavy loaded, then it happens that initialization of
accepting tcp connections is still in progress when the hived and its
webserver are being closed. Now, when webserver is being closed, then it
calls stop_listen() method on all its websocket services, this will lead
to close and deinitialize various internal objects, in oour case the
problem is with closing acceptor, in details it cleans boost::asio::detail::reactive_socket_service_base
and its epoll reactor:
boost/asio/detail/impl/epoll_reactor.ipp
void epoll_reactor::cleanup_descriptor_data(
per_descriptor_data& descriptor_data)
{
if (descriptor_data)
{
free_descriptor_state(descriptor_data);
descriptor_data = 0;
}
}
But delayed and pending accept initialization needs it(it is a descriptor_data passed by reference here):
boost/asio/detail/impl/epoll_reactor.ipp
void epoll_reactor::start_op(int op_type, socket_type descriptor,
epoll_reactor::per_descriptor_data& descriptor_data, reactor_op* op,
bool is_continuation, bool allow_speculative)
{
if (!descriptor_data)
{
op->ec_ = boost::asio::error::bad_descriptor;
post_immediate_completion(op, is_continuation);
return;
}
mutex::scoped_lock descriptor_lock(descriptor_data->mutex_);
if (descriptor_data->shutdown_)
{
post_immediate_completion(op, is_continuation);
return;
}
Here we got classical race conditions, the thread which calls the accept initialization checks if 'descriptor_data' is not null, but stop_listening() running in another thread set it to null after the check, then we got crash in line 'if (descriptor_data->shutdown_)'.
To easly repeat the issue a sleep needs to be added:
boost/asio/detail/impl/epoll_reactor.ipp
void epoll_reactor::start_op(int op_type, socket_type descriptor,
epoll_reactor::per_descriptor_data& descriptor_data, reactor_op* op,
bool is_continuation, bool allow_speculative)
{
if (!descriptor_data)
{
op->ec_ = boost::asio::error::bad_descriptor;
post_immediate_completion(op, is_continuation);
return;
}
sleep(5);
mutex::scoped_lock descriptor_lock(descriptor_data->mutex_);
if (descriptor_data->shutdown_)
{
post_immediate_completion(op, is_continuation);
return;
}
and haf system test could be run to cause the crash: pytest -k test_dump_load_instance_scripts[30-40-40-30-40-1-0]
The solution is to resign form stop_listening() and allow websocket server to stop and destroy their internals with stop() method.