Three Systems, One Problem to Debug
Once an LLM application hits production, engineers start asking questions like these:
- Why is this request so slow — is the model taking too long, or is it the application logic?
- Token usage tripled last week. Which user? Which prompt?
- A user reported "the AI gave a terrible answer." Can we find that conversation?
- Same scenario, different model — how do p95 latency and cost actually compare?
None of these questions are unusual. The problem is that the data usually lives in different systems: traces in Jaeger, token metrics in Prometheus, and conversation content in Elasticsearch. A single LLM call's data is split across three backends, and answering one question means bouncing between three UIs.
The hard part of LLM observability isn't collecting data — it's that your traces, metrics, and conversations live in three databases that don't talk to each other.
This post walks through a demo built on the OpenTelemetry GenAI Semantic Conventions, showing how GreptimeDB works as a unified backend for all three.
OTel GenAI Semantic Conventions: A Common Vocabulary for LLM Telemetry
OpenTelemetry (OTel) has become the de facto standard for observability in distributed systems. Since 2024, the OTel community has been extending it to cover GenAI workloads — defining a standardized gen_ai.* attribute schema so that LLM telemetry looks the same regardless of which SDK or provider you're using.
Three Signal Types
The spec covers three signal types:
Traces — each LLM call produces a Span with structured attributes:
| Attribute | Meaning | Example |
|---|---|---|
gen_ai.system | Provider | openai |
gen_ai.request.model | Requested model | gpt-4o-mini |
gen_ai.response.model | Actual model used | gpt-4o-mini-2024-07-18 |
gen_ai.usage.input_tokens | Input token count | 142 |
gen_ai.usage.output_tokens | Output token count | 87 |
gen_ai.response.finish_reasons | Stop reason (JSON array) | ["stop"], ["tool_calls"] |
Metrics — two histogram instruments, emitted automatically by compliant SDKs:
gen_ai.client.operation.duration— end-to-end latency per call (unit:s)gen_ai.client.token.usage— token consumption per call (unit:{token})
These land in GreptimeDB under different table names because of how Prometheus-compatible naming handles units:
gen_ai.client.operation.duration— has a time unit (s), so the_secondssuffix gets appended. Tables:gen_ai_client_operation_duration_seconds_bucket/count/sum.gen_ai.client.token.usage— uses a dimensionless unit ({token}), which gets stripped. Tables:gen_ai_client_token_usage_bucket/count/sum.
Logs/Events — opt-in capture of full conversation content (prompts and completions). Off by default to avoid capturing sensitive data; enable with OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true.
One Line of Code to Instrument
from opentelemetry.instrumentation.openai_v2 import OpenAIInstrumentor
OpenAIInstrumentor().instrument()
# Every subsequent OpenAI SDK call now emits gen_ai.* telemetry
client.chat.completions.create(model="gpt-4o-mini", messages=[...])opentelemetry-instrumentation-openai-v2 monkey-patches the OpenAI client — no changes to your application code. Spans and Metrics are on by default; log capture is opt-in via the env var above. The spec also covers Anthropic, Azure AI Inference, and AWS Bedrock.
Architecture: OTLP Directly into GreptimeDB
GenAI Application (Python)
│ OpenAI SDK + OTel GenAI Instrumentor
│ Emits traces + metrics + logs
│
│ OTLP/HTTP
▼
GreptimeDB
├── opentelemetry_traces ← span attributes flattened into queryable columns
├── genai_conversations ← full prompt/completion content
├── OTel metrics ← histograms, PromQL-compatible
└── Flow aggregation ← pre-aggregated token/latency/status rollups
│
Grafana
├── SQL panels (traces, logs, Flow tables)
├── PromQL panels (OTel histogram metrics)
└── Trace waterfall (GreptimeDB Grafana plugin)Three endpoints handle the three signal types (see the GreptimeDB OpenTelemetry docs):
# Traces: built-in pipeline flattens span attributes into queryable columns
OTLPSpanExporter(
endpoint="http://greptimedb:4000/v1/otlp/v1/traces",
headers={"x-greptime-pipeline-name": "greptime_trace_v1"},
)
# Metrics: standard OTLP; histograms automatically support PromQL
OTLPMetricExporter(endpoint="http://greptimedb:4000/v1/otlp/v1/metrics")
# Logs: header routes data to a specific table
OTLPLogExporter(
endpoint="http://greptimedb:4000/v1/otlp/v1/logs",
headers={"X-Greptime-Log-Table-Name": "genai_conversations"},
)No OTel Collector — data flows straight from the app to GreptimeDB. Fine for demos and small deployments. In production you'd typically add a Collector for buffering, sampling, and PII scrubbing, but the structure stays the same.
Here's what the Grafana dashboard looks like once it's running — request volume, token usage, estimated cost, and error rate, then trends and model comparisons below:

What Changes When the Data Lives Together
1. Cross-Signal Joins
With three separate backends, correlating a trace with its conversation log means manually copying a trace_id between tools — you can't express it as a single query.
Since GreptimeDB stores all three signal types, you can join traces and conversations in one SQL query. No tab-switching.
Both opentelemetry_traces and genai_conversations have trace_id and span_id columns. One thing to know: column names in opentelemetry_traces contain dots (span_attributes.gen_ai.request.model), which require double quotes in SQL. All examples below do this.
-- Find the highest-token prompts and see what users actually sent
SELECT
t.trace_id,
t."span_attributes.gen_ai.request.model" AS model,
t."span_attributes.gen_ai.usage.input_tokens" AS input_tokens,
t."span_attributes.gen_ai.usage.output_tokens" AS output_tokens,
json_get_string(parse_json(c.body), 'content') AS user_message
FROM opentelemetry_traces t
JOIN genai_conversations c ON t.trace_id = c.trace_id AND t.span_id = c.span_id
WHERE t."span_attributes.gen_ai.system" IS NOT NULL
AND json_get_string(parse_json(c.body), 'message.role') IS NULL
ORDER BY input_tokens DESC
LIMIT 10;With a split Prometheus + Elasticsearch setup, you'd usually need application-side correlation or an additional analytics layer, because the data isn't queryable through a shared engine.
2. Derive Metrics from Spans — No Double-Writing
GreptimeDB's Flow engine does continuous aggregation, similar to a materialized view that updates as new rows arrive.
Each LLM call Span is already a wide event with everything you need: model name, token counts, duration, status. Flow lets you derive rollups directly from traces — no need to emit a separate metrics stream from the application.
-- Per-model token consumption, aggregated per minute
CREATE FLOW genai_token_usage_flow
SINK TO genai_token_usage_1m
EXPIRE AFTER '24h'
AS
SELECT
"span_attributes.gen_ai.request.model" AS model,
COUNT("span_attributes.gen_ai.request.model") AS request_count,
SUM(CAST("span_attributes.gen_ai.usage.input_tokens" AS DOUBLE)) AS total_input_tokens,
SUM(CAST("span_attributes.gen_ai.usage.output_tokens" AS DOUBLE)) AS total_output_tokens,
date_bin('1 minute'::INTERVAL, "timestamp") AS time_window
FROM opentelemetry_traces
WHERE "span_attributes.gen_ai.system" IS NOT NULL
GROUP BY "span_attributes.gen_ai.request.model", time_window;
-- Latency distribution: uddsketch stores a quantile sketch for p50/p95/p99
CREATE FLOW genai_latency_flow
SINK TO genai_latency_1m
EXPIRE AFTER '24h'
AS
SELECT
"span_attributes.gen_ai.request.model" AS model,
COUNT("span_attributes.gen_ai.request.model") AS request_count,
uddsketch_state(128, 0.01, duration_nano) AS duration_sketch,
date_bin('1 minute'::INTERVAL, "timestamp") AS time_window
FROM opentelemetry_traces
WHERE "span_attributes.gen_ai.system" IS NOT NULL
GROUP BY "span_attributes.gen_ai.request.model", time_window;The demo defines a third Flow for status aggregation (request count by model and status code) — see the full flows.sql for all three.
uddsketch_state(buckets, error_rate, value) — GreptimeDB 1.0 RC1's quantile aggregation function. Parameters: bucket count (128), error rate (1%), value column. Query p50/p95/p99 directly from the rollup table — no full trace scan needed:
SELECT
model,
ROUND(uddsketch_calc(0.50, duration_sketch) / 1000000, 1) AS p50_ms,
ROUND(uddsketch_calc(0.95, duration_sketch) / 1000000, 1) AS p95_ms,
ROUND(uddsketch_calc(0.99, duration_sketch) / 1000000, 1) AS p99_ms,
time_window
FROM genai_latency_1m
ORDER BY time_window DESC
LIMIT 20;3. Full-Text Search on Conversations, Linked to Traces
When log capture is enabled, every user message and model response lands in genai_conversations. GreptimeDB creates the table automatically on first ingest and puts a full-text index on the body column — no manual setup.
Search by keyword, get a list of matching conversations, click a trace_id to jump straight to the waterfall. matches_term() handles the search:
SELECT
timestamp,
trace_id,
CASE WHEN json_get_string(parse_json(body), 'message.role') IS NOT NULL
THEN json_get_string(parse_json(body), 'message.role')
ELSE 'user' END AS role,
COALESCE(
json_get_string(parse_json(body), 'message.content'),
json_get_string(parse_json(body), 'content')
) AS content
FROM genai_conversations
WHERE matches_term(body, 'GreptimeDB')
ORDER BY timestamp DESC
LIMIT 20;body structure differs by role: user messages are {"content": "..."} at the top level; assistant responses nest it as {"message": {"role": "assistant", "content": "..."}}. The COALESCE handles both.
The dashboard's "Search Conversations" panel is this query with a text input in front.


For tool calling, RAG pipelines, and multi-agent systems, traces produce nested span trees. Here's the tool-calling scenario from the demo:
tool_call_pipeline (4.01s)
├── plan_tool_use (2.62s)
│ └── chat llama3.2 (2.62s) ← model decides it needs a tool call
├── execute_tools (59.31ms)
│ └── tool.calculate (59.23ms) ← simulated tool execution
└── synthesize_answer (1.33s)
└── chat llama3.2 (1.33s) ← model called again with tool results
What You Can Pull Out of the Data
Cost estimation — based on actual token usage per model (rates below are gpt-4o-mini as of this writing; check the OpenAI pricing page for current numbers):
SELECT
"span_attributes.gen_ai.request.model" AS model,
SUM("span_attributes.gen_ai.usage.input_tokens") AS input_tokens,
SUM("span_attributes.gen_ai.usage.output_tokens") AS output_tokens,
ROUND(
SUM("span_attributes.gen_ai.usage.input_tokens") * 0.15 / 1000000
+ SUM("span_attributes.gen_ai.usage.output_tokens") * 0.60 / 1000000,
4
) AS estimated_cost_usd
FROM opentelemetry_traces
WHERE "span_attributes.gen_ai.system" IS NOT NULL
AND timestamp > NOW() - '1 hour'::INTERVAL
GROUP BY model
ORDER BY estimated_cost_usd DESC;Error rate by model:
SELECT
"span_attributes.gen_ai.request.model" AS model,
COUNT(*) AS total,
COUNT(CASE WHEN span_status_code = 'STATUS_CODE_ERROR' THEN 1 END) AS errors,
ROUND(
COUNT(CASE WHEN span_status_code = 'STATUS_CODE_ERROR' THEN 1 END) * 100.0
/ COUNT(*),
1
) AS error_rate_pct
FROM opentelemetry_traces
WHERE "span_attributes.gen_ai.system" IS NOT NULL
AND timestamp > NOW() - '1 hour'::INTERVAL
GROUP BY model
ORDER BY error_rate_pct DESC;PromQL — the OTel SDK histograms are PromQL-compatible out of the box:
# p95 token consumption
histogram_quantile(0.95,
sum(rate(gen_ai_client_token_usage_bucket[5m])) by (le, gen_ai_token_type)
)
# Request rate by model
sum(rate(gen_ai_client_operation_duration_seconds_count[5m])) by (gen_ai_request_model)SQL and PromQL both work against the same data, so you can mix panel types in a single Grafana dashboard without any extra setup.

Getting Started
The code is at GreptimeTeam/demo-scene:
git clone https://github.com/GreptimeTeam/demo-scene.git
cd demo-scene/genai-observability
export OPENAI_API_KEY="sk-..."
docker compose --profile load up -d
# Grafana at http://localhost:3000 (admin / admin)
# → "GenAI Observability" dashboardOllama works too if you'd rather not use the OpenAI API:
docker compose --profile local up -d
docker compose --profile local exec ollama ollama pull llama3.2
OPENAI_BASE_URL=http://ollama:11434/v1 MODEL_NAME=llama3.2 \
docker compose --profile load up -dWrapping Up
The OTel GenAI spec handles the "how do I emit consistent telemetry" problem. What it doesn't solve is where that data goes or how you query it afterward. GreptimeDB's angle here is straightforward: keep all three signal types in one place, use Flow to derive metrics from spans instead of writing them twice, and let SQL handle the cross-signal correlation that a three-system stack makes awkward. PromQL still works, so existing dashboards and alerts don't need to change.
If you try it out, we'd be curious what your own p95 latency and token distribution look like — especially if you're running multiple models side by side.


