Deadlock on 'Cannot determine space for asset' assert exception while doing replay
I encountered a deadlock while doing replay when sql error occurs while copying data.
I can reproduce it by first installing debug HAF extension and then by doing replay on the main build.
I suspect this is because we try to kill the node on error, but any thread currently waiting on condition variable is not interrupted.
Hived log just before the deadlock:
2024-06-06T09:30:34.021076 container_data_writer.h:104 flush_replayed_data ] flush_replayed_data hive.accounts start
2024-06-06T09:30:34.022138 container_data_writer.h:129 flush_replayed_data ] flush_replayed_data hive.accounts end
2024-06-06T09:30:34.022334 block_num_rendezvous_trigger.cpp:26 report_complete_thre ] _stage_block_num=1000
2024-06-06T09:30:34.022381 block_num_rendezvous_trigger.cpp:27 report_complete_thre ] m_number_of_threads=13
2024-06-06T09:30:34.022414 data_processor.cpp:113 operator() ] Accounts data writer data processor finished processing a data chunk...
2024-06-06T09:30:34.022471 data_processor.cpp:75 operator() ] Accounts data writer data processor is waiting for DATA-READY signal...
2024-06-06T09:30:34.022689 transaction_controllers.cpp:128 do_reconnect ] Connected to database: `dbname=haf_block_log user=hived password=hived hostaddr=127.0.0.1 port=5432'.
2024-06-06T09:30:34.022721 transaction_controllers.cpp:133 do_reconnect ] SET synchronous_commit = OFF;
2024-06-06T09:30:34.022854 container_data_writer.h:104 flush_replayed_data ] flush_replayed_data hive.operations start
2024-06-06T09:30:34.023474 transaction_controllers.cpp:128 do_reconnect ] Connected to database: `dbname=haf_block_log user=hived password=hived hostaddr=127.0.0.1 port=5432'.
2024-06-06T09:30:34.023511 transaction_controllers.cpp:133 do_reconnect ] SET synchronous_commit = OFF;
2024-06-06T09:30:34.023702 container_data_writer.h:104 flush_replayed_data ] flush_replayed_data hive.account_operations start
2024-06-06T09:30:34.023992 transaction_controllers.cpp:128 do_reconnect ] Connected to database: `dbname=haf_block_log user=hived password=hived hostaddr=127.0.0.1 port=5432'.
2024-06-06T09:30:34.024025 transaction_controllers.cpp:133 do_reconnect ] SET synchronous_commit = OFF;
2024-06-06T09:30:34.024179 container_data_writer.h:104 flush_replayed_data ] flush_replayed_data hive.operations start
2024-06-06T09:30:34.020340 transaction_controllers.cpp:133 do_reconnect ] SET synchronous_commit = OFF;
2024-06-06T09:30:34.024407 container_data_writer.h:104 flush_replayed_data ] flush_replayed_data hive.account_operations start
2024-06-06T09:30:34.017327 transaction_controllers.cpp:128 do_reconnect ] Connected to database: `dbname=haf_block_log user=hived password=hived hostaddr=127.0.0.1 port=5432'.
2024-06-06T09:30:34.024536 transaction_controllers.cpp:133 do_reconnect ] SET synchronous_commit = OFF;
2024-06-06T09:30:34.024941 container_data_writer.h:104 flush_replayed_data ] flush_replayed_data hive.blocks start
2024-06-06T09:30:34.025636 transaction_controllers.cpp:128 do_reconnect ] Connected to database: `dbname=haf_block_log user=hived password=hived hostaddr=127.0.0.1 port=5432'.
2024-06-06T09:30:34.025672 transaction_controllers.cpp:133 do_reconnect ] SET synchronous_commit = OFF;
2024-06-06T09:30:34.026505 container_data_writer.h:129 flush_replayed_data ] flush_replayed_data hive.account_operations end
2024-06-06T09:30:34.026520 container_data_writer.h:129 flush_replayed_data ] flush_replayed_data hive.account_operations end
2024-06-06T09:30:34.026626 block_num_rendezvous_trigger.cpp:26 report_complete_thre ] _stage_block_num=1000
2024-06-06T09:30:34.026648 block_num_rendezvous_trigger.cpp:27 report_complete_thre ] m_number_of_threads=13
2024-06-06T09:30:34.026669 data_processor.cpp:113 operator() ] Account operations data writer_0 data processor finished processing a data chunk...
2024-06-06T09:30:34.026689 data_processor.cpp:75 operator() ] Account operations data writer_0 data processor is waiting for DATA-READY signal...
2024-06-06T09:30:34.026813 container_data_writer.h:104 flush_replayed_data ] flush_replayed_data hive.operations start
2024-06-06T09:30:34.027038 block_num_rendezvous_trigger.cpp:26 report_complete_thre ] _stage_block_num=1000
2024-06-06T09:30:34.027071 block_num_rendezvous_trigger.cpp:27 report_complete_thre ] m_number_of_threads=13
2024-06-06T09:30:34.027092 data_processor.cpp:113 operator() ] Account operations data writer_1 data processor finished processing a data chunk...
2024-06-06T09:30:34.027125 data_processor.cpp:75 operator() ] Account operations data writer_1 data processor is waiting for DATA-READY signal...
2024-06-06T09:30:34.028323 data_processor.cpp:118 operator() ] Operation data writer_1 data processor errored...
2024-06-06T09:30:34.028372 data_processor.cpp:228 handle_exception ] Data processor Operation data writer_1 detected SQL statement execution failure. Failing statement: `[END COPY]'.
2024-06-06T09:30:34.028399 data_processor.cpp:16 kill_node ] An error occured and HAF is stopping synchronization...
2024-06-06T09:30:34.029287 data_processor.cpp:118 operator() ] Operation data writer_4 data processor errored...
2024-06-06T09:30:34.029326 data_processor.cpp:228 handle_exception ] Data processor Operation data writer_4 detected SQL statement execution failure. Failing statement: `[END COPY]'.
2024-06-06T09:30:34.029343 data_processor.cpp:16 kill_node ] An error occured and HAF is stopping synchronization...
2024-06-06T09:30:34.032280 data_processor.cpp:118 operator() ] Operation data writer_0 data processor errored...
2024-06-06T09:30:34.032563 data_processor.cpp:228 handle_exception ] Data processor Operation data writer_0 detected SQL statement execution failure. Failing statement: `[END COPY]'.
2024-06-06T09:30:34.032726 data_processor.cpp:16 kill_node ] An error occured and HAF is stopping synchronization...
2024-06-06T09:30:34.038666 transaction_controllers.cpp:128 do_reconnect ] Connected to database: `dbname=haf_block_log user=hived password=hived hostaddr=127.0.0.1 port=5432'.
2024-06-06T09:30:34.038695 transaction_controllers.cpp:133 do_reconnect ] SET synchronous_commit = OFF;
2024-06-06T09:30:34.039348 container_data_writer.h:104 flush_replayed_data ] flush_replayed_data hive.operations start
2024-06-06T09:30:34.071491 data_processor.cpp:118 operator() ] Operation data writer_2 data processor errored...
2024-06-06T09:30:34.071544 data_processor.cpp:228 handle_exception ] Data processor Operation data writer_2 detected SQL statement execution failure. Failing statement: `[END COPY]'.
2024-06-06T09:30:34.071569 data_processor.cpp:16 kill_node ] An error occured and HAF is stopping synchronization...
2024-06-06T09:30:34.121174 container_data_writer.h:129 flush_replayed_data ] flush_replayed_data hive.blocks end
2024-06-06T09:30:34.121344 block_num_rendezvous_trigger.cpp:26 report_complete_thre ] _stage_block_num=1000
2024-06-06T09:30:34.121372 block_num_rendezvous_trigger.cpp:27 report_complete_thre ] m_number_of_threads=13
2024-06-06T09:30:34.121393 data_processor.cpp:113 operator() ] Block data writer data processor finished processing a data chunk...
2024-06-06T09:30:34.121424 data_processor.cpp:75 operator() ] Block data writer data processor is waiting for DATA-READY signal...
2024-06-06T09:30:34.121444 data_processor.cpp:80 operator() ] Block data writer data processor resumed by DATA-READY signal...
2024-06-06T09:30:34.121469 data_processor.cpp:87 operator() ] Block data writer data processor interrupted
2024-06-06T09:30:34.121487 data_processor.cpp:124 operator() ] Leaving data processor thread: Block data writer
In gdb
I can see sql threads are still running and waiting:
52 Thread 0x7fa5f77e6640 (LWP 16244) "sql[endmassive]" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x558ed96ba5f0) at ./nptl/futex-internal.c:57
53 Thread 0x7fa5f67e4640 (LWP 16246) "sql[trx]" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x558ed9690990) at ./nptl/futex-internal.c:57
54 Thread 0x7fa5f5fe3640 (LWP 16247) "sql[trx]" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x558ed9696660) at ./nptl/futex-internal.c:57
55 Thread 0x7fa5f57e2640 (LWP 16248) "sql[trx_multi]" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x558ed9694fc0) at ./nptl/futex-internal.c:57
--Type <RET> for more, q to quit, c to continue without paging--c
56 Thread 0x7fa5e5ffb640 (LWP 16254) "sql[account]" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x558ed95c9ca0) at ./nptl/futex-internal.c:57
57 Thread 0x7fa5e57fa640 (LWP 16255) "sql[account_op]" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x558ed9695114) at ./nptl/futex-internal.c:57
58 Thread 0x7fa5e4ff9640 (LWP 16256) "sql[account_op]" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x558ed962abc0) at ./nptl/futex-internal.c:57
59 Thread 0x7fa5cffff640 (LWP 16257) "sql[hardfork]" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x558ed9620e10) at ./nptl/futex-internal.c:57
In /var/log/postgresql/postgresql-16-main.log
I can see errors related to body_binary column:
2024-06-06 09:30:34.070 UTC [16264] hived@haf_block_log ERROR: Assert Exception:(nai_data_digits >= SMT_MIN_NAI) & (nai_data_digits <= SMT_MAX_NAI) & (nai_1bit == SMT_ASSET_NUM_CONTROL_MASK) & (nai_decimal_places <= HIVE_ASSET_MAX_DECIMALS): Cannot determine space for asset 1163154179Error unpacking field symbolerror unpacking hive::protocol::assetError unpacking field vesting_shareserror unpacking hive::protocol::producer_reward_operationerror unpacking hive::protocol::operation
2024-06-06 09:30:34.070 UTC [16264] hived@haf_block_log CONTEXT: COPY operations, line 1, column body_binary: "\x4009696e69746d696e6572e80300000000000003535445454d0000"
2024-06-06 09:30:34.070 UTC [16264] hived@haf_block_log STATEMENT: COPY hive.operations(id, trx_in_block, op_pos, timestamp, body_binary) FROM STDIN
2024-06-06 09:34:28.828 UTC [4731] LOG: checkpoint starting: time
2024-06-06 09:36:12.568 UTC [4731] LOG: checkpoint complete: wrote 1033 buffers (6.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=103.677 s, sync=0.012 s, total=103.741 s; sync files=52, longest=0.005 s, average=0.001 s; distance=7975 kB, estimate=7975 kB; lsn=6F/18C72808, redo lsn=6F/18C727D0
Full log replay.log