โœ•

Join us at two must-attend conferences this summer and claim your exclusive SWAG! ๐ŸŽ

IOT TECH EXPO, June 5-6 in Santa Clara! Booth 270 | DATA + AI SUMMIT, June 11-13 in San Francisco! Booth e12

โœ•
Skip to content
On this page

Subscribe to our newsletter

Get the latest dates and news about GreptimeDB.
Tutorial
โ€ข
May 26, 2023

GreptimeDB User Guide - How to Query and Analyze External Data

In order to help new users quickly master the usage of GreptimeDB and get value out of real time, we have prepared a series of articles on how to use GreptimeDB. These articles will provide detailed installation and usage steps, categorized according to different use cases, ensuring an effective application of our product in specific application environments. In this article, we elaborate on how to import and query external data using GreptimeDB.

What is GreptimeDB โ€‹

GreptimeDB is a powerful time-series database built using the Rust programming language, making it safe, fast, and efficient for real-time writing, reading, and analysis of time series data.

In order to help new users quickly master the usage of GreptimeDB and get value out of real time, we have prepared a series of articles on how to use GreptimeDB. These articles will provide detailed installation and usage steps, categorized according to different use cases, ensuring an effective application of our product in specific application environments.

Together, let's explore the infinite possibilities of GreptimeDB!

How to use GreptimeDB to analyze external data โ€‹

The true value of data can only be realized when it is being analyzed. GreptimeDB enhances data analysis capabilities by supporting SQL and Python scripts.

Let's say you have a data file in CSV format and you're interested in leveraging the analytical power of GreptimeDB. There are two ways that GreptimeDB provides for rapid querying and analysis of external data. In the subsequent sections, we'll elaborate on each of these methods.

To simplify our explanation, let's imagine a hypothetical scenario. This scenario draws from a fragment of the publicly available Kaggle dataset:

You are collaborating with the government to create a digital and intelligent city that improves the efficiency of public services. One of the challenges is traffic congestion. As a data scientist, your goal is to better manage urban traffic and provide advice on future infrastructure planning.

The government's objective is to establish a robust transportation system that can adeptly handle peak traffic periods. To achieve this, they've provided you with a comprehensive set of historical data outlining traffic patterns at various city intersections. Your assignment is to leverage this data to develop a model that enables effective analysis and prediction of traffic trends.

Prepare โ€‹

Environment โ€‹

If you do not have GreptimeDB installed in your environment, the following content will guide you on how to quickly install and start GreptimeDB. If you have already installed and started GreptimeDB locally, you can skip to the data preparation chapter.

  1. Install GreptimeDB (using Docker as an example)

Open the terminal on your computer and enter the following code. (The same applies below.)

bash
docker pull greptime/greptimedb:latest

For more installation methods, navigate to the download page or Getting Started docs to learn more.

  1. Start GreptimeDB
bash
docker run -p 4000-4004:4000-4004 \
-p 4242:4242 -v "$(pwd)/greptimedb:/tmp/greptimedb" \
--name greptime --rm greptime/greptimedb standalone start \
--http-addr 0.0.0.0:4000 \
--rpc-addr 0.0.0.0:4001 \
--mysql-addr 0.0.0.0:4002 \
--postgres-addr 0.0.0.0:4003 \
--prom-addr 0.0.0.0:4004 \
--opentsdb-addr 0.0.0.0:4242
  1. Connect to GreptimeDB

Use the following code to connect to GreptimeDB using MySQL or other compatible clients to ensure the service is working properly.

bash
mysql -h 127.0.0.1 -P 4002

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 5.1.10-alpha-msql-proxy Greptime

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show tables;
+---------+
| Tables  |
+---------+
| numbers |
| scripts |
+---------+
2 rows in set (0.01 sec)

Data โ€‹

GreptimeDB supports importing data in the formats of CSV, JSON, and Parquet. We have downloaded an IoT dataset from the publicly available Kaggle dataset that primarily describes the traffic conditions at four junctions in a city.

The following shows a part of the data.

bash
DateTime,Junction,Vehicles,ID
2015-11-01 00:00:00,1,15,20151101001
2015-11-01 01:00:00,1,13,20151101011
2015-11-01 02:00:00,1,10,20151101021
2015-11-01 03:00:00,1,7,20151101031
2015-11-01 04:00:00,1,9,20151101041
2015-11-01 05:00:00,1,6,20151101051
2015-11-01 06:00:00,1,9,20151101061
2015-11-01 07:00:00,1,8,20151101071
2015-11-01 08:00:00,1,11,20151101081

While preparing your data, it's crucial to ensure that the field names in the data header do not match any of the keywords within the GreptimeDB. For instance, DateTime in the data table is a keyword reserved by the database. Therefore, it is necessary to modify such names before importing the data. As a suggestion, DateTime could be renamed to TS.

In this tutorial, we started GreptimeDB through Docker and mapped files using -v "$(pwd)/greptimedb:/tmp/greptimedb". As a result, we need to copy the prepared data file to the corresponding directory.

bash
copy /<path_of_data_folder>/train_ML_IOT.csv $(pwd)/greptimedb/

Query and analyze external data โ€‹

Query data through external table โ€‹

GreptimeDB facilitates the process of querying and analyzing external data directly, simplifying the procedure for users.

  1. Create external table
bash
mysql> CREATE EXTERNAL TABLE  train_ML_IOT_sample WITH (LOCATION='/tmp/greptimedb/train_ML_IOT.csv',format='csv');
Query OK, 0 rows affected (0.08 sec)
  1. Query and analyze the data through SQL/Python
bash
mysql> SELECT * FROM "train_ML_IOT_sample" WHERE "Junction" = 1 LIMIT 10;
+---------------------+----------+----------+-------------+
| TS                  | Junction | Vehicles | ID          |
+---------------------+----------+----------+-------------+
| 2015-11-01 00:00:00 |        1 |       15 | 20151101001 |
| 2015-11-01 01:00:00 |        1 |       13 | 20151101011 |
| 2015-11-01 02:00:00 |        1 |       10 | 20151101021 |
| 2015-11-01 03:00:00 |        1 |        7 | 20151101031 |
| 2015-11-01 04:00:00 |        1 |        9 | 20151101041 |
| 2015-11-01 05:00:00 |        1 |        6 | 20151101051 |
| 2015-11-01 06:00:00 |        1 |        9 | 20151101061 |
| 2015-11-01 07:00:00 |        1 |        8 | 20151101071 |
| 2015-11-01 08:00:00 |        1 |       11 | 20151101081 |
| 2015-11-01 09:00:00 |        1 |       12 | 20151101091 |
+---------------------+----------+----------+-------------+
10 rows in set (0.02 sec)

In this approach, data isn't actually imported into GreptimeDB, yet the analytical capabilities of GreptimeDB can still be leveraged. In the following section, we demonstrate a recommended method to do so.

Query data through importing external data โ€‹

When dealing with substantial data volumes, particularly exceeding 100MB, direct queries on external tables may result in poor performance. In such instances, it is advisable to import the data into GreptimeDB for analysis.

  1. Create table

We need to create a table in advance for the data to be imported, including fields, types, and indexes used for queries, etc. Suppose there might be a need to query the traffic situation at different junctions by time in the above data, then we need to create indices for the TS and Junction. In a time-series database, the TS field often serves as the Time index, while Junction is a well-known Primary Key.

Tip: For the table creation process, you can quickly obtain the schema by creating an external table, following this specific method:

bash
mysql> CREATE EXTERNAL TABLE  train_ML_IOT_sample WITH (LOCATION='/tmp/greptimedb/train_ML_IOT.csv',format='csv');
Query OK, 0 rows affected (0.08 sec)

mysql> SHOW CREATE TABLE train_ML_IOT_sample;
+---------------------+------------------------------------------------------------+
| Table               | Create Table                                               |
+---------------------+------------------------------------------------------------+
| train_ML_IOT_sample | CREATE EXTERNAL TABLE IF NOT EXISTS train_ML_IOT_sample (
  TS TIMESTAMP(0) NULL,
  Junction BIGINT NULL,
  Vehicles BIGINT NULL,
  ID BIGINT NULL,
)
ENGINE=file
WITH(
  format = 'csv',
  location = '/tmp/greptimedb/train_ML_IOT.csv'
) |
+---------------------+------------------------------------------------------------+

Based on the results above, we can see the field types recognized by GreptimeDB. Then we need to copy/paste the field names and types and execute the following code to create an internal table:

bash
CREATE TABLE IF NOT EXISTS train_ML_IOT (
  TS TIMESTAMP(0) NULL,
  Junction BIGINT NULL,
  Vehicles BIGINT NULL,
  ID BIGINT NULL,
  TIME INDEX (TS),
  PRIMARY KEY (Junction)
);
  1. Import data

GreptimeDB supports the COPY syntax for importing and exporting data. For more details, please refer to https://docs.greptime.com/reference/sql/copy

bash
mysql> COPY train_ML_IOT FROM '/tmp/greptimedb/train_ML_IOT.csv' WITH (FORMAT='csv');
Query OK, 48120 rows affected (0.11 sec)
  1. Query and analyze the data through SQL/Python
bash
mysql> SELECT * FROM "train_ML_IOT" WHERE "Junction" = 1 LIMIT 10;
+---------------------+----------+----------+-------------+
| TS                  | Junction | Vehicles | ID          |
+---------------------+----------+----------+-------------+
| 2017-06-20 08:00:00 |        1 |       60 | 20170620081 |
| 2017-06-20 09:00:00 |        1 |       76 | 20170620091 |
| 2017-06-20 10:00:00 |        1 |       92 | 20170620101 |
| 2017-06-20 11:00:00 |        1 |      102 | 20170620111 |
| 2017-06-20 12:00:00 |        1 |      113 | 20170620121 |
| 2017-06-20 13:00:00 |        1 |      100 | 20170620131 |
| 2017-06-20 14:00:00 |        1 |      107 | 20170620141 |
| 2017-06-20 15:00:00 |        1 |      110 | 20170620151 |
| 2017-06-20 16:00:00 |        1 |      101 | 20170620161 |
| 2017-06-20 17:00:00 |        1 |      108 | 20170620171 |
+---------------------+----------+----------+-------------+
10 rows in set (0.03 sec)

Conclusion โ€‹

The two methods mentioned above allow for swift analysis and querying of external data via GreptimeDB. By harnessing the power of machine learning libraries, we can effectively carry out predictive analysis. Given that GreptimeDB supports Python, users can also employ machine learning libraries within the Python ecosystem for data modeling and analysis. For Python users, navigate to this document to learn more about data analysis using python.

GreptimeDB is currently undergoing swift iterations, and we are committed to providing more informative tutorials and best practice guidelines. Should you have any questions or encounter any issues, join our Slack community and reach out to us for assistance.

Join our community

Get the latest updates and discuss with other users.

Subscribe to our newsletter

Get the latest dates and news about GreptimeDB.