Join us for a virtual meetup on Zoom at 8 PM, July 31 (PDT) about using One Time Series Database for Both Metrics and Logs 👉🏻 Register Now

Skip to content
On this page
Engineering
May 21, 2024

Upgraded Internal Transaction Framework Design for Enhanced Fault Tolerance

To support internal transactions in a distributed database, GreptimeDB has designed and implemented the Procedure framework to ensure the atomicity, consistency, and durability of modifications to multiple state data in the database. This article builds upon the previous work and shares recent progress made by GreptimeDB in internal distributed transactions.

Single-node databases typically use Write-Ahead Logging (WAL) to ensure atomicity, consistency, and durability of operations. Distributed databases, on the other hand, face more complex scenarios due to the coordination of state across multiple machines. When designing distributed systems, it's common to assume that all modules are unreliable, a principle known as the "fault-tolerance assumption". In distributed systems, modules may encounter failures, delays, communication failures, and other issues, necessitating the design of systems to handle such unreliability.

To support internal transactions in a distributed database, GreptimeDB has designed and implemented the Procedure framework to ensure the atomicity, consistency, and durability of modifications to multiple state data in the database. In our previous article, Procedure Framework - How GreptimeDB Improves the Fault Tolerance Capability, introduced the motivation and implementation of the Procedure framework, leaving some plans for the future work. This article builds upon the previous work and shares recent progress made by GreptimeDB in internal distributed transactions.

It's worth noting that such internal transaction frameworks exist in the majority of distributed systems. For example, Apache HBase has a Procedure V2 framework, and Apache Accumulo implements the Fault-Tolerant Executor (FATE), a distributed multi-step operation framework. Furthermore, for business applications, there's Temporal, a persistent execution framework built on the same design principles, which supports a unicorn company valued at $1.5 billion. We believe that the technology and experiences shared in this article are very helpful for developers looking to explore the realm of data processing. 😃

Rollback-capable Procedure

In PR-3625, we introduced a rollback mechanism for Procedures, enabling the system to roll back when encountering non-retryable errors.

Taking the rollback-enabled Drop Table DDL (PR-3692) as an example. Initially, the Drop Table DDL consisted of the following four steps:

  1. Preparation
  2. Physical deletion of metadata
  3. Notification of Region deletion to the datanode
  4. Cache deletion

In this process, if a non-retryable error occurs during step 3, the entire procedure cannot be rolled back because the metadata has already been deleted. At this point, the Region without metadata becomes data fragments, which might lead to data leakage or other more serious issues.

To ensure atomicity when this operation fails, we refactored the steps of the Drop Table Procedure to support rollback. The rollback-capable Drop Table Procedure now includes the following five steps:

  1. Preparation
  2. Logical deletion of metadata
  3. Notification of Region deletion to the datanode
  4. Cache deletion
  5. Physical deletion of metadata
Drop Table Procedure
Drop Table Procedure State Machine Diagram

The yellow path in Figure 1 represents the rollback steps. If a non-retryable error occurs in step 3, you can roll back along the yellow path to ensure data integrity.

Fine-grained Locks in Procedures

Why Introducing Fine-grained Locks

Before introducing fine-grained locks (PR-3061), we only supported locking resources at the level of Tables using mutex locks within the Procedures.

However, this method might lead to the following issues:

  1. Drop Database requires locking all data table resources;
  2. During the execution of Drop Database, it was also possible to create tables in the database being deleted;
  3. When operating on resources smaller than a table, Procedures could only be executed serially.

For instance, the Region Migration of different Regions within the same table cannot be executed in parallel. This is because the Region Migration Procedure requires acquiring a lock at the table level, rather than at the more granular Region level.

At this point, astute students might be eager to answer, "Have you considered using Hierarchical Lock, a staple in university database exams?"

As you suspect, the implementation of fine-grained locks in Procedures is heavily inspired by the design principles of hierarchical locks. At the outset, we researched the implementation of hierarchical locks and dug up this landmark paper from 1975, Granularity of Locks and Degrees of Consistency in a Shared Data Base.

This paper on Hierarchical Locks is a milestone in the database field, authored by the IBM System R team. They published a series of articles on database topics during the same period, such as The Notions of Consistency and Predicate Locks in a Database System. These articles are still widely cited over the past decade. It's also worth mentioning that Predicate Locks haven't been entirely phased out by the times; there are applications of its variants in libraries like Rust Moka. These works laid the foundation for many concepts in today's relational databases, such as transactions, consistency, etc.

Compatibility of Hierarchical Locks
Compatibility of Hierarchical Locks

The core design goal of Hierarchical Locks in the original paper was to achieve a way of implicitly locking the entire subtree. Besides implicit subtree locking, there's also a way of explicitly locking the subtree, by obtaining locks from leaf nodes to root nodes step by step.

Design of Fine-grained Locks in GreptimeDB

Inspired by the approach of explicitly locking subtrees, we decided to integrate these two methods of subtree locking in designing fine-grained locks. This means we allow both implicit locking of the entire subtree and explicit locking of subtrees. By adopting this approach, we can manage concurrent Procedure executions more flexibly, thereby enhancing the performance and efficiency of the database.

Additionally, in the original paper, Hierarchical Locks offered multiple locking modes to provide more flexible concurrency control. However, these features are not necessary in our scenario. We have discarded these complex modes and retained only the shared mode and exclusive mode to simplify lock management.

Workflow of Procedures with Fine-grained Locks

Here are a few examples illustrating how fine-grained locks in Procedures help us solve the aforementioned issues:

Hierarchical Relationship Diagram of Resources
Hierarchical Relationship Diagram of Resources

DDL Procedure

For Procedures that modify databases, let's consider an operation aimed at deleting the greptime.foo database. First, we need to acquire an exclusive lock on the database resource named greptime.foo. This operation actually allows us to implicitly lock the entire subtree of resources under the greptime.foo database, thereby addressing the following two issues:

  1. We don't need to explicitly acquire locks for the entire subtree; we can implicitly lock all resources under this database.
Implicit Locking of the Entire Subtree Resources under `greptime.foo` Database
Implicit Locking of the Entire Subtree Resources under `greptime.foo` Database
  1. When executing Procedures for creating or deleting databases, corresponding DDL Procedures for tables under the database and Region Migration Procedures (ISSUE-2700) will not execute because they cannot obtain shared locks on the greptime.foo database resource.
Illustration of Mutex Relationships Between Procedures
Illustration of Mutex Relationships Between Procedures

Region Migration Procedure

For the Region Migration Procedure, we need to explicitly lock the entire subtree, obtaining locks from leaf nodes to root nodes step by step to fulfills our requirements:

  1. Region Migration Procedures for Different Regions of the Same Table can be Executed in Parallel.
Region migration Procedures in parallel for different regions of the same table
Region migration Procedures in parallel for different regions of the same table
  1. When executing the Region Migration Procedure, DDL Procedures for the same table will not be executed because they cannot obtain an exclusive lock on the greptime.foo.baz table resource.
Illustration of Mutex Relationships Between Procedures
Illustration of Mutex Relationships Between Procedures

Implementation of Drop Database

With the support of fine-grained locks, we were able to implement the Drop Database Procedure (ISSUE-3516) in a more elegant manner.

The state machine of the Drop Database Procedure is illustrated in the diagram below:

State Machine Diagram for Drop Database Procedure
State Machine Diagram for Drop Database Procedure

In essence, the Drop Database Procedure can be divided into the following steps:

  1. Acquire an exclusive lock on the database being deleted.
  2. The cursor is responsible for loading the metadata of the next table to be deleted.
drop database pprocedure
  1. The executor is responsible for deleting the corresponding metadata of the table, deleting the corresponding regions, and clearing the cache.
drop database pprocedure
  1. Repeat the above steps until all tables under the database are deleted.
drop database pprocedure
  1. Finally, delete the metadata of the database and clear the cache.

Future Work on Procedures

Currently, GreptimeDB's Procedure framework supports rollback and fine-grained locks. It can be said that all the basic functionalities required by GreptimeDB Procedures have been implemented. In the future, we will focus on adding more fuzz testing and chaos testing to the Procedure framework to improve its usage and optimization.

Additionally, there are still many DDL operations that have not been implemented. Their implementation will further utilize the Procedure framework, which may bring about more new requirements. Recent DDL support work includes:

  • Support for changing the type of columns in a table (PR-3796)
  • Support for creating views (PR-3807)

About Greptime

We help industries that generate large amounts of time-series data, such as Connected Vehicles (CV), IoT, and Observability, to efficiently uncover the hidden value of data in real-time.

Visit the latest version from any device to get started and get the most out of your data.

  • GreptimeDB, written in Rust, is a distributed, open-source, time-series database designed for scalability, efficiency, and powerful analytics.
  • GreptimeCloud is a fully-managed cloud database-as-a-service (DBaaS) solution built on GreptimeDB. It efficiently supports applications in fields such as observability, IoT, and finance. The built-in observability solution, GreptimeAI, helps users comprehensively monitor the cost, performance, traffic, and security of LLM applications.
  • Vehicle-Cloud Integrated TSDB solution is tailored for business scenarios of automotive enterprises. It addresses the practical business pain points that arise when enterprise vehicle data grows exponentially.

If anything above draws your attention, don't hesitate to star us on GitHub or join GreptimeDB Community on Slack. Also, you can go to our contribution page to find some interesting issues to start with.

fault tolerance
procedure design

Join our community

Get the latest updates and discuss with other users.