Overview
In today’s digital landscape, efficient data management is pivotal, especially when dealing with complex structures like
trees and graphs. This article takes a deep dive into the realm of hierarchical and network data management, leveraging
the robust capabilities of PostgreSQL and the interconnected nature of graph databases. We embark on a journey through a
series of experiments with PostgreSQL’s ltree extension, uncovering its prowess in handling complex hierarchical data,
and juxtapose this with the innate strengths of graph databases in managing dense networks.
Join us as we unravel the intricacies of these technologies and their practical applications.
Tree and Graph: A Comparative Study of Data Structures
Tree Structure
- Characteristics: Trees epitomize a hierarchical model, with each node (except the root) having one clear parent, forming a unique path from the root.
- Use Cases: Ideal for organizational charts, file systems, and taxonomies.
- Limitations: The single parent restriction and scalability issues can hinder handling complex datasets.
Graph Structure
- Characteristics: Graphs represent a network model without hierarchical constraints, allowing multiple and varied node relationships.
- Use Cases: Best suited for social networks, geographic information systems, and network analysis.
- Limitations: The complexity and resource intensity of graph operations can be challenging.
Graph Databases
Neo4j. Renowned for its robustness in cloud environments, Neo4j leads in graph database technology but limits horizontal scaling in its community edition.
Microsoft Azure Cosmos DB. Mirroring Neo4j’s capabilities, it shines in cloud-based deployment, especially for mature projects.
OrientDB. Now community-driven, OrientDB supports cluster nodes efficiently with a moderate resource requirement. 4GB is enough to run cluster node
TerminusDB, TypeDB, and others. These databases bring unique features to the table, with considerations like community support and licensing playing a crucial role in selection. But they are not so popular.
PostgreSQL for Comment Handling in Social Networks
Table Structure
To manage comments in a social network context, we can use a table structure in PostgreSQL like this:
Retrieving Comments and Their Parents
We can retrieve a comment along with all its parent comments using recursive queries. Here’s an example that limits the recursion depth to prevent infinite loops:
Keyset Pagination for Comments
To load comments page by page:
This approach uses the created_at
field to paginate through comments, effectively implementing keyset pagination.
Partitioning Strategies
- Range Partitioning: Comments can be partitioned based on the creation date (e.g., by month or year).
- List Partitioning: Partitioning can be done based on
post_id
, where each partition contains comments for a subset of posts.
Materialized Path for Popular Comments
For identifying popular comments, a materialized view can be used:
Scaling Strategies
Vertical Scaling: Enhancing server hardware capabilities.
Horizontal Scaling: Implementing read replicas, data sharding, and using partitioning.
Optimization Techniques: Involves query optimization, efficient schema design, and appropriate indexing.
Experiment with PostgreSQL ltree
To assess PostgreSQL’s capabilities in handling and querying hierarchical data, a comprehensive test was conducted. This test involved creating a large dataset using the ltree
extension in PostgreSQL, followed by executing a series of queries to evaluate performance. Four key scripts were used in this experiment, each serving a distinct purpose.
Script 1: Data Generation
The first script focused on generating a substantial volume of hierarchical data. The script created a table big_tree_test
with an ltree
column to effectively represent hierarchical paths. Key points of this script include:
- Table and Index Creation: Setting up a table with
ltree[]
and creating a B-tree index on this column. - Data Insertion: Inserting a large volume of rows to simulate a complex hierarchical structure.
Performance Metrics:
- Rows Generated: 21,510,000
- Time Taken: Approximately 45 minutes
- Memory Usage: About 5.859 GiB
- CPU Utilization: Utilized only one core, indicating an area for potential optimization.
Shortened version of script:
Full sql with generating ltree path:
Indexes
|
|
Size of data with indexes: 9,9G
Without btree: 6,4G
And without gist: 4,9G
Create gist
:
- time for : 6m 37s
- CPU: 1
- RAM: up to 2.2G
Script 2: Full Scan Query
This script performed a full table scan using the nlevel
function on the permissions
column.
Performance Metrics:
- Execution Time: 1 minute and 11 seconds
- This time reflects the duration for a full scan without the benefit of an optimized index for the operation.
|
|
here and everywhere in the article: psql (16.1 (Debian 16.1-1.pgdg120+1))
|
|
Script 3: Select Query with Pattern Matching
The third script executed a query using the ~
operator for pattern matching against the permissions
column.
Performance Metrics:
- Rows Fetched: 10,000
- First Run Time: 19 seconds
- Subsequent Run Time: 5 seconds
- The improved performance on the second run demonstrates PostgreSQL’s efficiency in caching and query optimization.
|
|
|
|
https://explain.dalibo.com/plan/c6bg7g1f2cbacgd3
Recheck Cond (‘.s_16340_4768122.‘::lquery ~ t1.permissions): This indicates that during the bitmap heap scan, the database needed to recheck the rows fetched by the index scan against the condition
*.s_16340_4768122.*'::lquery ~ t1.permissions
. This recheck is necessary because the bitmap index scan is lossy—it might fetch rows that seem to match based on the index but actually do not meet the condition upon rechecking.Rows Removed by Index Recheck (9,602,264): This is the number of rows that were initially identified by the index scan as potential matches but were later discarded upon rechecking the condition. It shows that the initial index scan fetched a lot of rows that did not actually meet the query condition.
https://wiki.postgresql.org/wiki/Bitmap_Indexes#General
Script 4: Select Query with Tree Operator
The final script used the @>
operator to fetch specific hierarchical data.
Performance Metrics:
- Rows Fetched: 10,000
- Execution Time: Approximately 200 milliseconds
- This query’s speed showcases the effectiveness of GiST indexes in handling
ltree
operations.
|
|
|
|