Exploring the Depths of Data Hierarchies with PostgreSQL and Graph Databases

Overview

In today’s digital landscape, efficient data management is pivotal, especially when dealing with complex structures like
trees and graphs. This article takes a deep dive into the realm of hierarchical and network data management, leveraging
the robust capabilities of PostgreSQL and the interconnected nature of graph databases. We embark on a journey through a
series of experiments with PostgreSQL’s ltree extension, uncovering its prowess in handling complex hierarchical data,
and juxtapose this with the innate strengths of graph databases in managing dense networks.
Join us as we unravel the intricacies of these technologies and their practical applications.

Tree and Graph: A Comparative Study of Data Structures

Tree Structure
  • Characteristics: Trees epitomize a hierarchical model, with each node (except the root) having one clear parent, forming a unique path from the root.
  • Use Cases: Ideal for organizational charts, file systems, and taxonomies.
  • Limitations: The single parent restriction and scalability issues can hinder handling complex datasets.
Graph Structure
  • Characteristics: Graphs represent a network model without hierarchical constraints, allowing multiple and varied node relationships.
  • Use Cases: Best suited for social networks, geographic information systems, and network analysis.
  • Limitations: The complexity and resource intensity of graph operations can be challenging.

Graph Databases

Neo4j. Renowned for its robustness in cloud environments, Neo4j leads in graph database technology but limits horizontal scaling in its community edition.

Microsoft Azure Cosmos DB. Mirroring Neo4j’s capabilities, it shines in cloud-based deployment, especially for mature projects.

OrientDB. Now community-driven, OrientDB supports cluster nodes efficiently with a moderate resource requirement. 4GB is enough to run cluster node

TerminusDB, TypeDB, and others. These databases bring unique features to the table, with considerations like community support and licensing playing a crucial role in selection. But they are not so popular.

PostgreSQL for Comment Handling in Social Networks

Table Structure

To manage comments in a social network context, we can use a table structure in PostgreSQL like this:

1
2
3
4
5
6
7
8
9
CREATE TABLE comments (
id UUID PRIMARY KEY,
post_id UUID NOT NULL REFERENCES posts(id),
user_id UUID NOT NULL REFERENCES users(id),
parent_comment_id UUID REFERENCES comments(id),
content TEXT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP
);

Retrieving Comments and Their Parents

We can retrieve a comment along with all its parent comments using recursive queries. Here’s an example that limits the recursion depth to prevent infinite loops:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
WITH RECURSIVE recursive_cte AS (
SELECT
id,
parent_comment_id,
1 AS level
FROM
comments
WHERE
id = _id
UNION
SELECT
c.parent_comment_id AS id,
c.parent_comment_id,
rec.level + 1 AS level
FROM
recursive_cte rec
INNER JOIN comments c ON rec.parent_comment_id = c.id
WHERE
level < _max_recursion
)
SELECT r.id, r.parent_comment_id, r.level
FROM recursive_cte r;

Keyset Pagination for Comments

To load comments page by page:

1
2
3
SELECT * FROM comments
WHERE post_id = $post_id AND created_at < $last_created_at
ORDER BY created_at DESC

This approach uses the created_at field to paginate through comments, effectively implementing keyset pagination.

Partitioning Strategies

  • Range Partitioning: Comments can be partitioned based on the creation date (e.g., by month or year).
  • List Partitioning: Partitioning can be done based on post_id, where each partition contains comments for a subset of posts.

For identifying popular comments, a materialized view can be used:

1
2
3
4
5
6
CREATE MATERIALIZED VIEW popular_comments AS
SELECT c.*, COUNT(l.id) AS likes_count
FROM comments c
LEFT JOIN likes l ON l.comment_id = c.id
GROUP BY c.id
ORDER BY likes_count DESC;

Scaling Strategies

Vertical Scaling: Enhancing server hardware capabilities.

Horizontal Scaling: Implementing read replicas, data sharding, and using partitioning.

Optimization Techniques: Involves query optimization, efficient schema design, and appropriate indexing.

Experiment with PostgreSQL ltree

To assess PostgreSQL’s capabilities in handling and querying hierarchical data, a comprehensive test was conducted. This test involved creating a large dataset using the ltree extension in PostgreSQL, followed by executing a series of queries to evaluate performance. Four key scripts were used in this experiment, each serving a distinct purpose.

Script 1: Data Generation

The first script focused on generating a substantial volume of hierarchical data. The script created a table big_tree_test with an ltree column to effectively represent hierarchical paths. Key points of this script include:

  • Table and Index Creation: Setting up a table with ltree[] and creating a B-tree index on this column.
  • Data Insertion: Inserting a large volume of rows to simulate a complex hierarchical structure.

Performance Metrics:

  • Rows Generated: 21,510,000
  • Time Taken: Approximately 45 minutes
  • Memory Usage: About 5.859 GiB
  • CPU Utilization: Utilized only one core, indicating an area for potential optimization.

Shortened version of script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
create or replace function pg_temp.random_int(_min int, _max int) returns int language sql as
'select floor(random()* (_max - _min + 1) + _min)::int;';
insert into big_tree_test
select
t1.id as id,
null as parent_id,
'#' || t1.id || ' root' || t1.r
from (
select id, pg_temp.random_int(1, _root_rows) r
from generate_series(1, _root_rows) id
) t1;
-- after that we can create comments with `parent_id` from generated ones
insert into big_tree_test
select
t1.id as id,
p as parent_id,
' l' || i || ' ' || t1.id || ' sub' || t1.r
from (
select _root_rows + _rows_per_group * 1 + id as id,
pg_temp.random_int(1, _root_rows + _rows_per_group * 1) p, -- random parent_id
pg_temp.random_int(1, _rows_per_group) r -- just random number
from generate_series(1, _rows_per_group) id
) t1;

Full sql with generating ltree path:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
do
language plpgsql
$$
declare
_root_rows int = 1500000;
_rows_per_group int = 200;
_max_group int = 100000;
_very_deep int = 10000;
begin
drop table if exists big_tree_test;
CREATE TABLE big_tree_test (
id serial PRIMARY KEY,
parent_id int NULL,
content text NOT NULL,
permissions ltree[]
);
create or replace function pg_temp.random_int(_min int, _max int) returns int language sql as
'select floor(random()* (_max - _min + 1) + _min)::int;';
insert into big_tree_test
select
t1.id as id,
null as parent_id,
'#' || t1.id || ' root' || t1.r,
cast((concat('{"n', t1.id,'"}')) as ltree[]) as permissions
from (
select id, pg_temp.random_int(1, _root_rows) r
from generate_series(1, _root_rows) id
) t1;
-- the most readable is to use already created *page* as parents
-- after that use 2 pages, 3, ... else we would have null as parent_id
FOR i IN 0..(_max_group - 1) LOOP
insert into big_tree_test
select
t1.id as id,
p as parent_id,
' l' || i || ' ' || t1.id || ' sub' || t1.r,
cast((
SELECT
array_append('{}'::ltree[], concat(ss.permissions[1], '.s_', i, '_', t1.id)::ltree)
FROM big_tree_test ss
WHERE id = t1.p
) as ltree[]) AS permissions
from (
select _root_rows + _rows_per_group * i + id as id,
pg_temp.random_int(1, _root_rows + _rows_per_group * i) p,
pg_temp.random_int(1, _rows_per_group) r
from generate_series(1, _rows_per_group) id
) t1;
END LOOP;
-- to create very deep comments we have to use parent_id of the last created
FOR i IN 1..(_very_deep) LOOP
insert into big_tree_test
select
t1.id as id,
p as parent_id,
' dd' || i || ' ' || t1.id || ' sub' || t1.r,
cast((
SELECT
array_append('{}'::ltree[], concat(ss.permissions[1], '.s_', i, '_', t1.id)::ltree)
FROM big_tree_test ss
WHERE id = t1.p
) as ltree[]) AS permissions
from (
select _root_rows + _rows_per_group * _max_group + i as id,
pg_temp.random_int(_root_rows + _rows_per_group * _max_group + i * 9 / 10, _root_rows + _rows_per_group * _max_group + i - 1) p,
pg_temp.random_int(1, _rows_per_group) r
) t1;
END LOOP;
end;
$$;

Indexes

1
2
3
4
-- bad one. it may be used for full eq
CREATE INDEX idx_big_tree_test_permissions ON big_tree_test USING btree(permissions);
-- good one
CREATE INDEX idx_big_tree_gist_permissions ON big_tree_test USING gist(permissions);

Size of data with indexes: 9,9G
Without btree: 6,4G
And without gist: 4,9G
Create gist:

  • time for : 6m 37s
    • CPU: 1
    • RAM: up to 2.2G

Script 2: Full Scan Query

This script performed a full table scan using the nlevel function on the permissions column.

Performance Metrics:

  • Execution Time: 1 minute and 11 seconds
  • This time reflects the duration for a full scan without the benefit of an optimized index for the operation.
1
2
3
4
5
SELECT
*,
nlevel(t1.permissions[1]) as level
FROM big_tree_test t1
ORDER BY level DESC

here and everywhere in the article: psql (16.1 (Debian 16.1-1.pgdg120+1))

1
2
3
4
5
6
7
8
Gather Merge (cost=3422605.51..5514001.53 rows=17925000 width=136)
Workers Planned: 2
-> Sort (cost=3421605.49..3444011.74 rows=8962500 width=136)
Sort Key: (nlevel(permissions[1])) DESC
-> Parallel Seq Scan on big_tree_test t1 (cost=0.00..548625.25 rows=8962500 width=136)
JIT:
Functions: 2
Options: Inlining true, Optimization true, Expressions true, Deforming true

Script 3: Select Query with Pattern Matching

The third script executed a query using the ~ operator for pattern matching against the permissions column.

Performance Metrics:

  • Rows Fetched: 10,000
  • First Run Time: 19 seconds
  • Subsequent Run Time: 5 seconds
  • The improved performance on the second run demonstrates PostgreSQL’s efficiency in caching and query optimization.
1
2
3
4
SELECT
COUNT(*)
FROM big_tree_test t1
WHERE '*.s_16340_4768122.*' ~ t1.permissions
1
2
3
4
5
Aggregate (cost=8265.99..8266.00 rows=1 width=8)
-> Bitmap Heap Scan on big_tree_test t1 (cost=101.09..8260.61 rows=2151 width=0)
Recheck Cond: ('*.s_16340_4768122.*'::lquery ~ permissions)
-> Bitmap Index Scan on idx_big_tree_gist_permissions (cost=0.00..100.55 rows=2151 width=0)
Index Cond: (permissions ~ '*.s_16340_4768122.*'::lquery)

https://explain.dalibo.com/plan/c6bg7g1f2cbacgd3
image
image

  1. Recheck Cond (‘.s_16340_4768122.‘::lquery ~ t1.permissions): This indicates that during the bitmap heap scan, the database needed to recheck the rows fetched by the index scan against the condition *.s_16340_4768122.*'::lquery ~ t1.permissions. This recheck is necessary because the bitmap index scan is lossy—it might fetch rows that seem to match based on the index but actually do not meet the condition upon rechecking.

  2. Rows Removed by Index Recheck (9,602,264): This is the number of rows that were initially identified by the index scan as potential matches but were later discarded upon rechecking the condition. It shows that the initial index scan fetched a lot of rows that did not actually meet the query condition.

https://wiki.postgresql.org/wiki/Bitmap_Indexes#General

Script 4: Select Query with Tree Operator

The final script used the @> operator to fetch specific hierarchical data.

Performance Metrics:

  • Rows Fetched: 10,000
  • Execution Time: Approximately 200 milliseconds
  • This query’s speed showcases the effectiveness of GiST indexes in handling ltree operations.
1
2
3
4
SELECT
*
FROM big_tree_test t1
WHERE 'n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000' @> t1.permissions
1
2
3
4
Bitmap Heap Scan on big_tree_test t1 (cost=101.09..8260.61 rows=2151 width=132)
Recheck Cond: ('n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000'::ltree @> permissions)
-> Bitmap Index Scan on idx_big_tree_gist_permissions (cost=0.00..100.55 rows=2151 width=0)
Index Cond: (permissions <@ 'n1258534.s_13709_4241826.s_16340_4768122.s_99999_21500000'::ltree)

https://explain.dalibo.com/plan/5g3a7g5eb172gb04
image