Draft: Refactor get_ranked_posts.sql to add a CTE to ensure posts get filtered by...

mentioned in issue #260 (closed)

added 1 commit

8d5b2ac9 - Add faster observer check for muted accounts and refactor CTE in get_account_posts_by_replies

Compare with previous version

added 4 commits

8d5b2ac9...55f55884 - 2 commits from branch develop
4340b63c - Refactor get_ranked_posts.sql to add a CTE to ensure posts get filtered by...
c6530c18 - Add faster observer check for muted accounts and refactor CTE in get_account_posts_by_replies

Compare with previous version

added 1 commit

8b72422c - Refactor get_trending_ranked_posts_for_communities to use MATERIALIZED CTE for...

Compare with previous version

added 5 commits

0bb99780 - Refactor get_ranked_posts.sql to add a CTE to ensure posts get filtered by...
e420d4d0 - Add faster observer check for muted accounts and refactor CTE in get_account_posts_by_replies
1b0edcb7 - Retag docker images for regular builds as /hive/hivemind, removing
9c4446a6 - Postgrest - restore LIMIT in queres which uses CTE and limit, appears that...
25fbbf1f - Refactor get_trending_ranked_posts_for_communities to use MATERIALIZED CTE for...

Compare with previous version

added 7 commits

25fbbf1f...e02e4640 - 3 commits from branch develop
3753f7dd - Refactor get_ranked_posts.sql to add a CTE to ensure posts get filtered by...
98ef8474 - Add faster observer check for muted accounts and refactor CTE in get_account_posts_by_replies
4732eacf - Refactor get_trending_ranked_posts_for_communities to use MATERIALIZED CTE for...
30015e74 - Merge branch 'speedup-ranked-posts' of gitlab.syncad.com:hive/hivemind into speedup-ranked-posts

Compare with previous version

added 1 commit

36c5854e - Fix formatting in get_account_posts.sql by removing merge conflict markers

Compare with previous version

added 1 commit

462c63e0 - Revert check related to muting, it is the replies being muted, not the posts.

Compare with previous version

This index speeds up get_account_posts_by_replies:

DROP INDEX IF EXISTS hivemind_app.hive_posts_parent_id_id_idx;
CREATE INDEX IF NOT EXISTS hive_posts_parent_id_id_idx
    ON hivemind_app.hive_posts USING btree
    (parent_id ASC NULLS LAST, id DESC NULLS FIRST)
    INCLUDE (author_id)
    TABLESPACE haf_tablespace
    WHERE counter_deleted = 0;

The trick is it avoids having to read the hive_posts table to fetch the author_ids. There can be a lot of posts by some accounts (e.g. rentmoney had 30K+ posts).

added 2 commits

92aafa4e - Fix muted accounts check in get_account_posts.sql to reference the correct author ID
29059142 - Refactor SQL scripts to improve query performance and fix account ID...

Compare with previous version

added 1 commit

0bedb57e - Remove obsolete join replaced by subquery

Compare with previous version

added 1 commit

ca020dc0 - Fix error in get_trending_ranked_posts_for_communities

Compare with previous version

One remaining known slow one is get_created_ranked_posts_for_observer_communities when the observer is not subscribed to any recent communities (or worse, not subscribed to any at all). We have an idea of how to fix it, but the first attempt didn't work, although it still seems like it should, so it needs more investigation as to why it is still slow (in the fix attempt it seems like it is still reading hive_posts when it should just use data from an index only scan). Here was fix attempt (not necessarily the last version but suggests the idea):

ROP FUNCTION IF EXISTS hivemind_postgrest_utilities.get_created_ranked_posts_for_observer_communities;
CREATE FUNCTION hivemind_postgrest_utilities.get_created_ranked_posts_for_observer_communities(IN _post_id INT, IN _observer_id INT, IN _limit INT)
RETURNS JSONB
LANGUAGE 'plpgsql'
STABLE
AS
$$
DECLARE
  _result JSONB;
BEGIN

  _result = (
    SELECT jsonb_agg (
        hivemind_postgrest_utilities.create_bridge_post_object(row, 0, NULL, row.is_pinned, True)
    ) FROM ( -- get_created_ranked_posts_for_observer_communities
      WITH observer_subscriptions AS MATERIALIZED (
        SELECT
          community_id
        FROM hivemind_app.hive_subscriptions
        WHERE
          account_id = _observer_id
      ),
      max_post_id as MATERIALIZED ( --find the max post id for each subscribed community
        SELECT
          COALESCE(MAX(hp.id), 0) as max_id
        FROM hivemind_app.live_posts_view hp
        JOIN observer_subscriptions os ON hp.community_id = os.community_id
        where hp.id < _post_id OR _post_id = 0
      ),
      observer_posts as
      (
        SELECT
          hp.id,
          blacklist.source
        FROM hivemind_app.live_posts_view hp
        JOIN hivemind_app.hive_accounts_view ha ON hp.author_id = ha.id
        JOIN observer_subscriptions hs ON hs.community_id = hp.community_id
        LEFT OUTER JOIN hivemind_app.blacklisted_by_observer_view blacklist ON (blacklist.observer_id = _observer_id AND blacklist.blacklisted_id = hp.author_id)
        WHERE
          NOT ha.is_grayed 
          AND hp.id <= (SELECT max_id from max_post_id)
          AND NOT (_observer_id <> 0 AND EXISTS (SELECT 1 FROM hivemind_app.muted_accounts_by_id_view WHERE observer_id = _observer_id AND muted_id = hp.author_id))
        ORDER BY
          hp.id DESC
        LIMIT _limit
      )
      SELECT
        hp.id,
        hp.author,
        hp.parent_author,
        hp.author_rep,
        hp.root_title,
        hp.beneficiaries,
        hp.max_accepted_payout,
        hp.percent_hbd,
        hp.url,
        hp.permlink,
        hp.parent_permlink_or_category,
        hp.title,
        hp.body,
        hp.category,
        hp.depth,
        hp.promoted,
        hp.payout,
        hp.pending_payout,
        hp.payout_at,
        hp.is_paidout,
        hp.children,
        hp.votes,
        hp.created_at,
        hp.updated_at,
        hp.rshares,
        hp.abs_rshares,
        hp.json,
        hp.is_hidden,
        hp.is_grayed,
        hp.total_votes,
        hp.sc_trend,
        hp.role_title,
        hp.community_title,
        hp.role_id,
        hp.is_pinned,
        hp.curator_payout_value,
        hp.is_muted,
        observer_posts.source AS blacklists,
        hp.muted_reasons
      FROM observer_posts,
      LATERAL hivemind_app.get_post_view_by_id(observer_posts.id) hp
      ORDER BY
        hp.id DESC
      LIMIT _limit
    ) row
  );

  RETURN COALESCE(_result, '[]'::jsonb);
END
$$
;

!783 (comment 188921)

As noted, this is the query that is still being experimented with. In theory it should be faster, but postgres is somewhat willfully reading the hive.posts table when it should only use the index. We'll finish attempting to fix this one after I'm back from vacation.

I wanna ask about using MATERIALIZED. I learnt that I shouldn't use it. I cannot paste you any example because it was some time ago, but when I had a case which I examined I found out that MATERIALIZED slightly increases execution time. In other cases which I checked this didn't have influence on execution time, so I decided to remove MATERIALIZED. I see you put them back, but now I also try to analize execution time with and without MATERIALIZED and I still cannot see significant differs in exectution time. Can you explain my why you use that?

Sure, you use MATERIALIZED if you want to enforce your CTE (it will not bring stuff from outside the CTE in). So when you use materialize, you are saying you are smarter than the planner about the order to do things. That's frequently a dangerous thing to do, especially as things can change over time, unless you've checked every possibility and really thought it through.

Thanks

@dan I guess we should merge some improvements into develop. Will you merge this MR into develop or should I merge mine ( !783 (merged) ) after applying your changes into my MR?

BTW you have an error in pipeline and I think you wanted to do something like that in get_trending_ranked_posts_for_communities:

      community_posts as MATERIALIZED
      (
        SELECT id, author_id, sc_trend, is_pinned FROM hivemind_app.live_posts_view
        WHERE community_id = (SELECT id FROM hivemind_app.hive_communities WHERE name = _tag) 
        AND NOT is_paidout --use index hive_posts_community_id_not_is_paidout_idx
      ),

Go ahead and merge our changes into your MR and we can merge it. I'll keep this MR "as-is" so I can remember what I've done so far when I start back on it.

Fix any typos, ofc, I've mostly been testing final versions directly on hive-staging, so this MR was probably slightly behind.

!783 (merged) has been merged. I checked final version of every query in this MR and if I saw better efficiency I put it into my MR. I pasted examples of queries there.

As a side note, we should check all the hive_post indexes for possible include of author_id (which is always needed for mute/blacklist checking) whenever that would allow a top sub query to do an "index ONLY scan".

mentioned in merge request !783 (merged)

Most if not all of these changes were merged in via another MR I made. At some point I may come back and review if I missed anything useful, but probably anything not included wasn't needed or caused performance regressions under some conditions.

closed

Draft: Refactor get_ranked_posts.sql to add a CTE to ensure posts get filtered by...

Merge request reports

Activity