2023-12-06

Exploring the Depths of Data Hierarchies with PostgreSQL and Graph Databases

Overview

In today’s digital landscape, efficient data management is pivotal, especially when dealing with complex structures like
trees and graphs. This article takes a deep dive into the realm of hierarchical and network data management, leveraging
the robust capabilities of PostgreSQL and the interconnected nature of graph databases. We embark on a journey through a
series of experiments with PostgreSQL’s ltree extension, uncovering its prowess in handling complex hierarchical data,
and juxtapose this with the innate strengths of graph databases in managing dense networks.
Join us as we unravel the intricacies of these technologies and their practical applications.

Tree and Graph: A Comparative Study of Data Structures

Tree Structure

Characteristics: Trees epitomize a hierarchical model, with each node (except the root) having one clear parent, forming a unique path from the root.
Use Cases: Ideal for organizational charts, file systems, and taxonomies.
Limitations: The single parent restriction and scalability issues can hinder handling complex datasets.

Graph Structure

Characteristics: Graphs represent a network model without hierarchical constraints, allowing multiple and varied node relationships.
Use Cases: Best suited for social networks, geographic information systems, and network analysis.
Limitations: The complexity and resource intensity of graph operations can be challenging.

Graph Databases

Neo4j. Renowned for its robustness in cloud environments, Neo4j leads in graph database technology but limits horizontal scaling in its community edition.

Microsoft Azure Cosmos DB. Mirroring Neo4j’s capabilities, it shines in cloud-based deployment, especially for mature projects.

OrientDB. Now community-driven, OrientDB supports cluster nodes efficiently with a moderate resource requirement. 4GB is enough to run cluster node

TerminusDB, TypeDB, and others. These databases bring unique features to the table, with considerations like community support and licensing playing a crucial role in selection. But they are not so popular.

Table Structure

To manage comments in a social network context, we can use a table structure in PostgreSQL like this:

CREATE TABLE comments (
    id UUID PRIMARY KEY,
    post_id UUID NOT NULL REFERENCES posts(id),
    user_id UUID NOT NULL REFERENCES users(id),
    parent_comment_id UUID REFERENCES comments(id),
    content TEXT NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMP
);

Retrieving Comments and Their Parents

We can retrieve a comment along with all its parent comments using recursive queries. Here’s an example that limits the recursion depth to prevent infinite loops:

WITH RECURSIVE recursive_cte AS (
    SELECT
        id,
        parent_comment_id,
        1 AS level
    FROM
        comments
    WHERE 
        id = _id
    
    UNION 
    SELECT
        c.parent_comment_id AS id,
        c.parent_comment_id,
        rec.level + 1 AS level
    FROM
        recursive_cte rec
        INNER JOIN comments c ON rec.parent_comment_id = c.id
    WHERE
        level < _max_recursion 
)
SELECT r.id, r.parent_comment_id, r.level
FROM recursive_cte r;

To load comments page by page:

1
2
3

SELECT * FROM comments
WHERE post_id = $post_id AND created_at < $last_created_at
ORDER BY created_at DESC

This approach uses the created_at field to paginate through comments, effectively implementing keyset pagination.

Partitioning Strategies

Range Partitioning: Comments can be partitioned based on the creation date (e.g., by month or year).
List Partitioning: Partitioning can be done based on post_id, where each partition contains comments for a subset of posts.

Materialized Path for Popular Comments

For identifying popular comments, a materialized view can be used:

CREATE MATERIALIZED VIEW popular_comments AS
SELECT c.*, COUNT(l.id) AS likes_count
FROM comments c
LEFT JOIN likes l ON l.comment_id = c.id
GROUP BY c.id
ORDER BY likes_count DESC;

Scaling Strategies

Vertical Scaling: Enhancing server hardware capabilities.

Horizontal Scaling: Implementing read replicas, data sharding, and using partitioning.

Optimization Techniques: Involves query optimization, efficient schema design, and appropriate indexing.

Experiment with PostgreSQL `ltree`

To assess PostgreSQL’s capabilities in handling and querying hierarchical data, a comprehensive test was conducted. This test involved creating a large dataset using the ltree extension in PostgreSQL, followed by executing a series of queries to evaluate performance. Four key scripts were used in this experiment, each serving a distinct purpose.

Script 1: Data Generation

The first script focused on generating a substantial volume of hierarchical data. The script created a table big_tree_test with an ltree column to effectively represent hierarchical paths. Key points of this script include:

Table and Index Creation: Setting up a table with ltree[] and creating a B-tree index on this column.
Data Insertion: Inserting a large volume of rows to simulate a complex hierarchical structure.

Performance Metrics:

Rows Generated: 21,510,000
Time Taken: Approximately 45 minutes
Memory Usage: About 5.859 GiB
CPU Utilization: Utilized only one core, indicating an area for potential optimization.

Shortened version of script:

create or replace function pg_temp.random_int(_min int, _max int) returns int language sql as
'select floor(random()* (_max - _min + 1) + _min)::int;';
insert into big_tree_test
select
    t1.id as id,
    null as parent_id,
    '#' || t1.id || ' root' || t1.r
from (
     select id, pg_temp.random_int(1, _root_rows) r
     from generate_series(1, _root_rows) id
 ) t1;
-- after that we can create comments with `parent_id` from generated ones
insert into big_tree_test
select
    t1.id as id,
    p as parent_id,
    ' l' || i || ' ' || t1.id || ' sub' || t1.r
from (
     select _root_rows + _rows_per_group * 1 + id as id,
            pg_temp.random_int(1, _root_rows + _rows_per_group * 1) p, -- random parent_id
            pg_temp.random_int(1, _rows_per_group) r -- just random number
     from generate_series(1, _rows_per_group) id
 ) t1;

Full sql with generating ltree path:

do
language plpgsql
$$
    declare
        _root_rows int = 1500000;
        _rows_per_group int = 200;
        _max_group int = 100000;
        _very_deep int = 10000;
    begin
        drop table if exists big_tree_test;
        CREATE TABLE big_tree_test (
                                       id serial PRIMARY KEY,
                                       parent_id int NULL,
                                       content text NOT NULL,
                                       permissions ltree[]
        );
        create or replace function pg_temp.random_int(_min int, _max int) returns int language sql as
        'select floor(random()* (_max - _min + 1) + _min)::int;';
        insert into big_tree_test
        select
                t1.id as id,
                null as parent_id,
                '#' || t1.id || ' root' || t1.r,
                cast((concat('{"n', t1.id,'"}')) as ltree[]) as permissions
        from (
                 select id, pg_temp.random_int(1, _root_rows) r
                 from generate_series(1, _root_rows) id
             ) t1;
        -- the most readable is to use already created *page* as parents
        -- after that use 2 pages, 3, ... else we would have null as parent_id
        FOR i IN 0..(_max_group - 1) LOOP
                insert into big_tree_test
                select
                        t1.id as id,
                        p as parent_id,
                        ' l' || i || ' ' || t1.id || ' sub' || t1.r,
                        cast((
                            SELECT
                                array_append('{}'::ltree[], concat(ss.permissions[1], '.s_', i, '_', t1.id)::ltree)
                            FROM big_tree_test ss
                            WHERE id = t1.p
                        ) as ltree[]) AS permissions
                from (
                         select _root_rows + _rows_per_group * i + id as id,
                                pg_temp.random_int(1, _root_rows + _rows_per_group * i) p,
                                pg_temp.random_int(1, _rows_per_group) r
                         from generate_series(1, _rows_per_group) id
                     ) t1;
        END LOOP;
        -- to create very deep comments we have to use parent_id of the last created
        FOR i IN 1..(_very_deep) LOOP
                insert into big_tree_test
                select
                        t1.id as id,
                        p as parent_id,
                        ' dd' || i || ' ' || t1.id || ' sub' || t1.r,
                        cast((
                            SELECT
                                array_append('{}'::ltree[], concat(ss.permissions[1], '.s_', i, '_', t1.id)::ltree)
                            FROM big_tree_test ss
                            WHERE id = t1.p
                        ) as ltree[]) AS permissions
                from (
                         select _root_rows + _rows_per_group * _max_group + i as id,
                                pg_temp.random_int(_root_rows + _rows_per_group * _max_group + i * 9 / 10, _root_rows + _rows_per_group * _max_group + i - 1) p,
                                pg_temp.random_int(1, _rows_per_group) r
                     ) t1;
        END LOOP;
    end;
$$;

Indexes

-- bad one. it may be used for full eq
CREATE INDEX idx_big_tree_test_permissions ON big_tree_test USING btree(permissions);
-- good one
CREATE INDEX idx_big_tree_gist_permissions ON big_tree_test USING gist(permissions);

Size of data with indexes: 9,9G
Without btree: 6,4G
And without gist: 4,9G
Create gist:

time for : 6m 37s
- CPU: 1
- RAM: up to 2.2G

Script 2: Full Scan Query

This script performed a full table scan using the nlevel function on the permissions column.

Performance Metrics:

Execution Time: 1 minute and 11 seconds
This time reflects the duration for a full scan without the benefit of an optimized index for the operation.

SELECT
    *,
    nlevel(t1.permissions[1]) as level
FROM big_tree_test t1
ORDER BY level DESC

here and everywhere in the article: psql (16.1 (Debian 16.1-1.pgdg120+1))

Gather Merge  (cost=3422605.51..5514001.53 rows=17925000 width=136)
  Workers Planned: 2
  ->  Sort  (cost=3421605.49..3444011.74 rows=8962500 width=136)
        Sort Key: (nlevel(permissions[1])) DESC
        ->  Parallel Seq Scan on big_tree_test t1  (cost=0.00..548625.25 rows=8962500 width=136)
JIT:
  Functions: 2
  Options: Inlining true, Optimization true, Expressions true, Deforming true

Script 3: Select Query with Pattern Matching

The third script executed a query using the ~ operator for pattern matching against the permissions column.

Performance Metrics:

Rows Fetched: 10,000
First Run Time: 19 seconds
Subsequent Run Time: 5 seconds
The improved performance on the second run demonstrates PostgreSQL’s efficiency in caching and query optimization.

SELECT
    COUNT(*)
FROM big_tree_test t1
WHERE '*.s_16340_4768122.*' ~ t1.permissions

Aggregate  (cost=8265.99..8266.00 rows=1 width=8)
  ->  Bitmap Heap Scan on big_tree_test t1  (cost=101.09..8260.61 rows=2151 width=0)
        Recheck Cond: ('*.s_16340_4768122.*'::lquery ~ permissions)
        ->  Bitmap Index Scan on idx_big_tree_gist_permissions  (cost=0.00..100.55 rows=2151 width=0)
              Index Cond: (permissions ~ '*.s_16340_4768122.*'::lquery)

https://explain.dalibo.com/plan/c6bg7g1f2cbacgd3

Recheck Cond (‘.s_16340_4768122.‘::lquery ~ t1.permissions): This indicates that during the bitmap heap scan, the database needed to recheck the rows fetched by the index scan against the condition *.s_16340_4768122.*'::lquery ~ t1.permissions. This recheck is necessary because the bitmap index scan is lossy—it might fetch rows that seem to match based on the index but actually do not meet the condition upon rechecking.
Rows Removed by Index Recheck (9,602,264): This is the number of rows that were initially identified by the index scan as potential matches but were later discarded upon rechecking the condition. It shows that the initial index scan fetched a lot of rows that did not actually meet the query condition.

https://wiki.postgresql.org/wiki/Bitmap_Indexes#General

Script 4: Select Query with Tree Operator

The final script used the @> operator to fetch specific hierarchical data.

Performance Metrics:

Rows Fetched: 10,000
Execution Time: Approximately 200 milliseconds
This query’s speed showcases the effectiveness of GiST indexes in handling ltree operations.

SELECT
    *
FROM big_tree_test t1
WHERE 'n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000' @> t1.permissions

Bitmap Heap Scan on big_tree_test t1  (cost=101.09..8260.61 rows=2151 width=132)
  Recheck Cond: ('n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000'::ltree @> permissions)
  ->  Bitmap Index Scan on idx_big_tree_gist_permissions  (cost=0.00..100.55 rows=2151 width=0)
        Index Cond: (permissions <@ 'n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000'::ltree)

https://explain.dalibo.com/plan/5g3a7g5eb172gb04

Overview

Tree and Graph: A Comparative Study of Data Structures

Tree Structure

Graph Structure

Graph Databases

PostgreSQL for Comment Handling in Social Networks

Table Structure

Retrieving Comments and Their Parents

Keyset Pagination for Comments

Partitioning Strategies

Materialized Path for Popular Comments

Scaling Strategies

Experiment with PostgreSQL ltree

Script 1: Data Generation

Indexes

Script 2: Full Scan Query

Script 3: Select Query with Pattern Matching

Script 4: Select Query with Tree Operator

Experiment with PostgreSQL `ltree`