Support for Log Data Ingestion, Enhanced Interaction with Grafana Plugin | Greptime Biweekly Report

Summary

Together with our global community of contributors, GreptimeDB continues to evolve and flourish as a growing open-source project. We are grateful to each and every one of you.

Below are the highlights among recent commits:

Log Ingestion Support: process and transform log data using Elastic Ingest Pipelines configuration, then ingest typed data into table schemalessly.
Implement SHOW CREATE FLOW: GreptimeFlow is continuously getting matured by supporting SHOW CREATE FLOW.
Simplify Parquet Writer: by removing redundant Writer between Arrow and OpenDAL/S3, reduce 6% time consumption in writing Parquet files.

New Projects

Released Grafana GreptimeDB Plugin

We have released the Grafana GreptimeDB plugin, which is based on the Grafana Prometheus plugin. This plugin provides better interaction and functionality support for GreptimeDB, including support for GreptimeDB's multi-value model. It is currently available for local installation. For more details: https://github.com/GreptimeTeam/greptimedb-grafana-datasource/

Contributors

For the past two weeks, our community has been super active with a total of 55 PRs merged. 6 PRs from 5 individual contributors merged successfully and lots pending to be merged.

@cjwcommuny (db#4117)

@irenjj（db#4040）

@realtaobo (db#4088)

@WL2O2O (dashboard#433)

@yuanbohan (db#4121 db#4123)

Congrats on becoming our most active contributors in the past 2 weeks:

👏 Welcome contributor @cjwcommuny @WL2O2O join to the community as the new individual contributor, and congratulations on successfully merging their first PR, more PRs are waiting to be merged.

A big THANK YOU to all our members and contributors! It is people like you who are making GreptimeDB a great product. Let's build an even greater community together.

Highlights of Recent PRs

db#4014 Log Ingestion Support

This PR introduces support for log ingestion. We use Elastic Ingest Pipelines syntax to define process and transform behavior, which we call Pipelines. After uploading Pipeline model to database, we can then use this Pipeline to process log into structured data and insert into tables.

For example, we can create a Pipeline like the following:

shell

curl -X "POST" "http://localhost:4000/v1/events/pipelines/test" \
     -H 'Content-Type: application/x-yaml' \
     -d 'processors:
  - date:
      field: time
      formats:
        - "%Y-%m-%d %H:%M:%S%.3f"
      ignore_missing: true

transform:
  - fields:
      - id1
      - id2
    type: int32
  - fields:
      - type
      - log
      - logger
    type: string
  - field: time
    type: time
    index: timestamp
'

It also supports putting the content into a file and uploading the whole file. Now a Pipeline named test is created in greptime_private.pipelines table. Then we can try to put some log into database:

shell

curl -X "POST" "http://localhost:4000/v1/events/logs?db=public&table=logs1&pipeline_name=test" \
     -H 'Content-Type: application/json' \
     -d '[
    {
      "id1": "2436",
      "id2": "2528",
      "logger": "INTERACT.MANAGER",
      "type": "I",
      "time": "2024-05-25 20:16:37.217",
      "log": "ClusterAdapter:enter sendTextDataToCluster\\n"
    }
  ]'

The log data is JSON formatted. The new /v1/events/logs api will look for Pipeline from pipeline_name parameter to process the Payload data. Note how the field is related to the Pipeline definition. A table named logs1 is created(if not exist already) and typed data is inserted into the table.

shell

mysql> show create table logs1;
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                                                                                                    |
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logs1 | CREATE TABLE IF NOT EXISTS `logs1` (
  `id1` INT NULL,
  `id2` INT NULL,
  `type` STRING NULL,
  `log` STRING NULL,
  `logger` STRING NULL,
  `time` TIMESTAMP(9) NOT NULL,
  TIME INDEX (`time`)
)

ENGINE=mito
WITH(
  append_mode = 'true'
) |
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

mysql> select * from logs1;
+------+------+------+---------------------------------------------+------------------+----------------------------+
| id1  | id2  | type | log                                         | logger           | time                       |
+------+------+------+---------------------------------------------+------------------+----------------------------+
| 2436 | 2528 | I    | ClusterAdapter:enter sendTextDataToCluster
 | INTERACT.MANAGER | 2024-05-25 20:16:37.217000 |
+------+------+------+---------------------------------------------+------------------+----------------------------+
1 row in set (0.03 sec)

db#4112 Simplify Parquet Writer

The BufferedWriter came to solve the problem that Arrow's Parquet Writer requires std::io::Write while OpenDAL only provides async S3 Writer that implements tokio::io::AsyncWrite. Now Arrow provides AsyncArrowWriter, we can remove those structs.

By removing those redundant structures and extra code paths, we achieve a 6% improvement on time consumption in writing Parquet files.

db#4040 Implement `SHOW CREATE FLOW`

Now we can SHOW CREATE FLOW after a Flow is created. For example, if we create a Flow like the following:

shell

mysql> CREATE FLOW IF NOT EXISTS my_flow
    -> SINK TO my_sink_table
    -> EXPIRE AFTER INTERVAL '1 hour'
    -> AS
    -> SELECT count(1) from monitor;
Query OK, 0 rows affected (0.04 sec)

We can use SHOW CREATE FLOW my_flow to check create sentences later on.

sql

mysql> show create flow my_flow;
+---------+-----------------------------------------------------------------------------------------------------------------------+
| Flow    | Create Flow                                                                                                           |
+---------+-----------------------------------------------------------------------------------------------------------------------+
| my_flow | CREATE OR REPLACE FLOW IF NOT EXISTS my_flow
SINK TO my_sink_table
EXPIRE AFTER 3600
AS SELECT count(1) FROM monitor |
+---------+-----------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

db#4151 Guide For Benchmarking GreptimeDB

We use a modified version of tsbs to benchmark GreptimeDB. Now we provide a guide so that users can run benchmarks too. It also can be used to run benchmark against any other database that tsbs supports, so that a comparison is generated. Please note to use release build before running the benchmark.

Please refer the benchmark guide here.

Good First Issue

db#4157 Fix `information_schema.region_peers` returns same `region_id`

We 'store' region_id in information_schema.region_peers. However, it seems only one region_id is returned even with multiple region peers. Find out if there is a bug in assembling the return value of information_schema.region_peers and fix it.

Keywords: Information Schema

Difficulty: Easy

Support for Log Data Ingestion, Enhanced Interaction with Grafana Plugin | Greptime Biweekly Report

Summary ​

New Projects ​

Released Grafana GreptimeDB Plugin ​

Contributors ​

Highlights of Recent PRs ​

db#4014 Log Ingestion Support ​

db#4112 Simplify Parquet Writer ​

db#4040 Implement SHOW CREATE FLOW ​

db#4151 Guide For Benchmarking GreptimeDB ​

Good First Issue ​

db#4157 Fix information_schema.region_peers returns same region_id ​

Join our community

Summary

New Projects

Released Grafana GreptimeDB Plugin

Contributors

Highlights of Recent PRs

db#4014 Log Ingestion Support

db#4112 Simplify Parquet Writer

db#4040 Implement `SHOW CREATE FLOW`

db#4151 Guide For Benchmarking GreptimeDB

Good First Issue

db#4157 Fix `information_schema.region_peers` returns same `region_id`