Skip to content
Snippets Groups Projects

Draft: Refactor get_ranked_posts.sql to add a CTE to ensure posts get filtered by...

Closed Dan Notestein requested to merge speedup-ranked-posts into develop
3 unresolved threads

Refactor get_ranked_posts.sql to add a CTE to ensure posts get filtered by community_id first (fix big slowdown).

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • mentioned in issue #260 (closed)

  • Dan Notestein added 1 commit

    added 1 commit

    • 8d5b2ac9 - Add faster observer check for muted accounts and refactor CTE in get_account_posts_by_replies

    Compare with previous version

  • Dan Notestein added 4 commits

    added 4 commits

    • 8d5b2ac9...55f55884 - 2 commits from branch develop
    • 4340b63c - Refactor get_ranked_posts.sql to add a CTE to ensure posts get filtered by...
    • c6530c18 - Add faster observer check for muted accounts and refactor CTE in get_account_posts_by_replies

    Compare with previous version

  • Dan Notestein added 1 commit

    added 1 commit

    • 8b72422c - Refactor get_trending_ranked_posts_for_communities to use MATERIALIZED CTE for...

    Compare with previous version

  • Dan Notestein added 5 commits

    added 5 commits

    • 0bb99780 - Refactor get_ranked_posts.sql to add a CTE to ensure posts get filtered by...
    • e420d4d0 - Add faster observer check for muted accounts and refactor CTE in get_account_posts_by_replies
    • 1b0edcb7 - Retag docker images for regular builds as /hive/hivemind, removing
    • 9c4446a6 - Postgrest - restore LIMIT in queres which uses CTE and limit, appears that...
    • 25fbbf1f - Refactor get_trending_ranked_posts_for_communities to use MATERIALIZED CTE for...

    Compare with previous version

  • Dan Notestein added 7 commits

    added 7 commits

    • 25fbbf1f...e02e4640 - 3 commits from branch develop
    • 3753f7dd - Refactor get_ranked_posts.sql to add a CTE to ensure posts get filtered by...
    • 98ef8474 - Add faster observer check for muted accounts and refactor CTE in get_account_posts_by_replies
    • 4732eacf - Refactor get_trending_ranked_posts_for_communities to use MATERIALIZED CTE for...
    • 30015e74 - Merge branch 'speedup-ranked-posts' of gitlab.syncad.com:hive/hivemind into speedup-ranked-posts

    Compare with previous version

  • Dan Notestein added 1 commit

    added 1 commit

    • 36c5854e - Fix formatting in get_account_posts.sql by removing merge conflict markers

    Compare with previous version

  • Dan Notestein added 1 commit

    added 1 commit

    • 462c63e0 - Revert check related to muting, it is the replies being muted, not the posts.

    Compare with previous version

  • This index speeds up get_account_posts_by_replies:

    DROP INDEX IF EXISTS hivemind_app.hive_posts_parent_id_id_idx;
    CREATE INDEX IF NOT EXISTS hive_posts_parent_id_id_idx
        ON hivemind_app.hive_posts USING btree
        (parent_id ASC NULLS LAST, id DESC NULLS FIRST)
        INCLUDE (author_id)
        TABLESPACE haf_tablespace
        WHERE counter_deleted = 0;

    The trick is it avoids having to read the hive_posts table to fetch the author_ids. There can be a lot of posts by some accounts (e.g. rentmoney had 30K+ posts).

  • Dan Notestein added 2 commits

    added 2 commits

    • 92aafa4e - Fix muted accounts check in get_account_posts.sql to reference the correct author ID
    • 29059142 - Refactor SQL scripts to improve query performance and fix account ID...

    Compare with previous version

  • Dan Notestein added 1 commit

    added 1 commit

    • 0bedb57e - Remove obsolete join replaced by subquery

    Compare with previous version

  • Dan Notestein added 1 commit

    added 1 commit

    • ca020dc0 - Fix error in get_trending_ranked_posts_for_communities

    Compare with previous version

    • One remaining known slow one is get_created_ranked_posts_for_observer_communities when the observer is not subscribed to any recent communities (or worse, not subscribed to any at all). We have an idea of how to fix it, but the first attempt didn't work, although it still seems like it should, so it needs more investigation as to why it is still slow (in the fix attempt it seems like it is still reading hive_posts when it should just use data from an index only scan). Here was fix attempt (not necessarily the last version but suggests the idea):

      ROP FUNCTION IF EXISTS hivemind_postgrest_utilities.get_created_ranked_posts_for_observer_communities;
      CREATE FUNCTION hivemind_postgrest_utilities.get_created_ranked_posts_for_observer_communities(IN _post_id INT, IN _observer_id INT, IN _limit INT)
      RETURNS JSONB
      LANGUAGE 'plpgsql'
      STABLE
      AS
      $$
      DECLARE
        _result JSONB;
      BEGIN
      
        _result = (
          SELECT jsonb_agg (
              hivemind_postgrest_utilities.create_bridge_post_object(row, 0, NULL, row.is_pinned, True)
          ) FROM ( -- get_created_ranked_posts_for_observer_communities
            WITH observer_subscriptions AS MATERIALIZED (
              SELECT
                community_id
              FROM hivemind_app.hive_subscriptions
              WHERE
                account_id = _observer_id
            ),
            max_post_id as MATERIALIZED ( --find the max post id for each subscribed community
              SELECT
                COALESCE(MAX(hp.id), 0) as max_id
              FROM hivemind_app.live_posts_view hp
              JOIN observer_subscriptions os ON hp.community_id = os.community_id
              where hp.id < _post_id OR _post_id = 0
            ),
            observer_posts as
            (
              SELECT
                hp.id,
                blacklist.source
              FROM hivemind_app.live_posts_view hp
              JOIN hivemind_app.hive_accounts_view ha ON hp.author_id = ha.id
              JOIN observer_subscriptions hs ON hs.community_id = hp.community_id
              LEFT OUTER JOIN hivemind_app.blacklisted_by_observer_view blacklist ON (blacklist.observer_id = _observer_id AND blacklist.blacklisted_id = hp.author_id)
              WHERE
                NOT ha.is_grayed 
                AND hp.id <= (SELECT max_id from max_post_id)
                AND NOT (_observer_id <> 0 AND EXISTS (SELECT 1 FROM hivemind_app.muted_accounts_by_id_view WHERE observer_id = _observer_id AND muted_id = hp.author_id))
              ORDER BY
                hp.id DESC
              LIMIT _limit
            )
            SELECT
              hp.id,
              hp.author,
              hp.parent_author,
              hp.author_rep,
              hp.root_title,
              hp.beneficiaries,
              hp.max_accepted_payout,
              hp.percent_hbd,
              hp.url,
              hp.permlink,
              hp.parent_permlink_or_category,
              hp.title,
              hp.body,
              hp.category,
              hp.depth,
              hp.promoted,
              hp.payout,
              hp.pending_payout,
              hp.payout_at,
              hp.is_paidout,
              hp.children,
              hp.votes,
              hp.created_at,
              hp.updated_at,
              hp.rshares,
              hp.abs_rshares,
              hp.json,
              hp.is_hidden,
              hp.is_grayed,
              hp.total_votes,
              hp.sc_trend,
              hp.role_title,
              hp.community_title,
              hp.role_id,
              hp.is_pinned,
              hp.curator_payout_value,
              hp.is_muted,
              observer_posts.source AS blacklists,
              hp.muted_reasons
            FROM observer_posts,
            LATERAL hivemind_app.get_post_view_by_id(observer_posts.id) hp
            ORDER BY
              hp.id DESC
            LIMIT _limit
          ) row
        );
      
        RETURN COALESCE(_result, '[]'::jsonb);
      END
      $$
      ;
      Edited by Dan Notestein
    • As noted, this is the query that is still being experimented with. In theory it should be faster, but postgres is somewhat willfully reading the hive.posts table when it should only use the index. We'll finish attempting to fix this one after I'm back from vacation.

    • Please register or sign in to reply
    • I wanna ask about using MATERIALIZED. I learnt that I shouldn't use it. I cannot paste you any example because it was some time ago, but when I had a case which I examined I found out that MATERIALIZED slightly increases execution time. In other cases which I checked this didn't have influence on execution time, so I decided to remove MATERIALIZED. I see you put them back, but now I also try to analize execution time with and without MATERIALIZED and I still cannot see significant differs in exectution time. Can you explain my why you use that?

      Edited by Andrzej Such
    • Sure, you use MATERIALIZED if you want to enforce your CTE (it will not bring stuff from outside the CTE in). So when you use materialize, you are saying you are smarter than the planner about the order to do things. That's frequently a dangerous thing to do, especially as things can change over time, unless you've checked every possibility and really thought it through.

    • Thanks

    • Please register or sign in to reply
    • @dan I guess we should merge some improvements into develop. Will you merge this MR into develop or should I merge mine ( !783 (merged) ) after applying your changes into my MR?

      BTW you have an error in pipeline and I think you wanted to do something like that in get_trending_ranked_posts_for_communities:

            community_posts as MATERIALIZED
            (
              SELECT id, author_id, sc_trend, is_pinned FROM hivemind_app.live_posts_view
              WHERE community_id = (SELECT id FROM hivemind_app.hive_communities WHERE name = _tag) 
              AND NOT is_paidout --use index hive_posts_community_id_not_is_paidout_idx
            ),
    • Go ahead and merge our changes into your MR and we can merge it. I'll keep this MR "as-is" so I can remember what I've done so far when I start back on it.

    • Fix any typos, ofc, I've mostly been testing final versions directly on hive-staging, so this MR was probably slightly behind.

    • !783 (merged) has been merged. I checked final version of every query in this MR and if I saw better efficiency I put it into my MR. I pasted examples of queries there.

    • Please register or sign in to reply
  • As a side note, we should check all the hive_post indexes for possible include of author_id (which is always needed for mute/blacklist checking) whenever that would allow a top sub query to do an "index ONLY scan".

    Edited by Dan Notestein
  • Andrzej Such mentioned in merge request !783 (merged)

    mentioned in merge request !783 (merged)

  • Most if not all of these changes were merged in via another MR I made. At some point I may come back and review if I missed anything useful, but probably anything not included wasn't needed or caused performance regressions under some conditions.

Please register or sign in to reply
Loading