troubleshooting warning general ·

Troubleshoot Database Latency with Grafana Assistant

Practical guide to troubleshooting database latency with Grafana. Includes production-ready steps, dashboard setup, metric correlation, and proactive alerting for...

Troubleshoot Database Latency with Grafana Assistant
Advertisement

Introduction

You deployed a new feature, but your dashboard just turned red. Database latency has spiked from 20ms to 3 seconds. Users are complaining, and you don’t know if it’s a slow query, connection pool exhaustion, or a blocked transaction.

Grafana can answer that question. This guide walks through the full troubleshooting workflow: starting with the right metrics, building a targeted dashboard, and using Grafana’s correlation features to trace a single slow transaction from your application code all the way to the database execution plan. By the end, you will know exactly how to move from “there’s a latency spike” to “this specific query on table orders is waiting on a lock from a long-running transaction in service payment-worker.”

Most articles stop at showing you a dashboard. They do not show you how to connect Grafana, Loki, and Tempo to perform root cause analysis. This one does.

Why Grafana for Database Latency Troubleshooting?

Grafana is not a database monitoring tool. It is a unified observability platform that aggregates metrics, logs, and traces from multiple data sources. This distinction matters when you are troubleshooting latency.

A dedicated database monitoring tool shows you query latency over time. That is useful, but it does not answer the deeper question: what caused the latency? Grafana with Prometheus (metrics), Loki (logs), and Tempo (traces) provides the full picture.

Consider this scenario: p99 query latency spikes from 50ms to 2000ms at 14:32. With only database metrics, you see the spike but not the cause. With Grafana, you can:

  • Pull the slow query log from Loki at exactly 14:32.
  • Open the Tempo trace for a single request that executed during the spike.
  • See that the trace includes a 1800ms database call, plus a 100ms gRPC call to an external payment service.
  • Realize the payment service timed out, leaving an open transaction that blocked subsequent queries.

That is the concrete advantage. Grafana turns a latency spike from a symptom into a solvable problem. For more on building this kind of pipeline, the tutorial on how to set up LLM observability with OpenTelemetry covers the same tracing approach applied to AI workloads.

Required Data Sources

Before you start, confirm you have these sources configured:

  • Prometheus or VictoriaMetrics for database and infrastructure metrics.
  • Loki for slow query logs and application error logs.
  • Tempo or Jaeger for distributed traces.

If you are using PostgreSQL, install the postgres_exporter for Prometheus. For MySQL, use mysqld_exporter. Both expose query execution time, connection counts, and database locks as Prometheus metrics.

Essential Metrics for Database Latency

You do not need every metric that exists. Focus on the ones that directly indicate the four common latency causes: slow queries, connection pool exhaustion, database load, and lock contention.

Query Execution Time (p50/p95/p99)

Latency distributions tell you more than averages. A 50ms average can hide a sub-second spike. Track p50 (typical user experience), p95 (worst-case for most users), and p99 (the outliers that break SLAs).

# PostgreSQL query latency histogram (from postgres_exporter)
histogram_quantile(0.99,
  rate(pg_stat_activity_max_tx_duration_seconds_bucket[1m])
)

Connection Pool Usage

Connection pool exhaustion mimics a slow database. When all connections are active, new requests queue. You need to see active, idle, and waiting connections in one panel.

# PostgreSQL connections by state
avg by (state) (
  pg_stat_activity_count{datname="your_database"}
)

Query Throughput

A sudden spike in queries per second (QPS) can overwhelm even a tuned database. Combine QPS with latency to distinguish between “the database is slow” and “the database is being hammered.”

# Queries per second
rate(pg_stat_database_xact_commit{datname!~"template.*"}[1m])

Database Load (CPU, I/O, Locks)

Database-level resource usage explains why queries are slow. High CPU suggests expensive operations (sequential scans, sorting). High I/O wait suggests missing indexes. High lock wait suggests transaction contention.

# Lock wait time (PostgreSQL)
rate(pg_stat_database_conflicts{datname!~"template.*"}[1m])

Building a Database Performance Dashboard in Grafana

Create a new dashboard and add these panels. Use dashboard variables for $database and $query_type to make it reusable without editing queries.

Variable Setup

  1. Go to Dashboard Settings > Variables.
  2. Add a variable named database of type Query, using your Prometheus data source.
  3. Query:
label_values(pg_stat_database_numbackends, datname)
  1. Add a second variable named query_type using a custom list: SELECT, INSERT, UPDATE, DELETE.

Panel 1: Query Latency Heatmap (p50/p95/p99)

A time series panel showing p50, p95, and p99 as separate series. Use a shared Y-axis (log scale works well for latency).

# p50
histogram_quantile(0.50,
  sum(rate(pg_stat_activity_max_tx_duration_seconds_bucket{datname="$database"}[5m]))
)

# p95
histogram_quantile(0.95,
  sum(rate(pg_stat_activity_max_tx_duration_seconds_bucket{datname="$database"}[5m]))
)

# p99
histogram_quantile(0.99,
  sum(rate(pg_stat_activity_max_tx_duration_seconds_bucket{datname="$database"}[5m]))
)

Set the Y-axis unit to seconds and the Min to 0.001 to hide stale sub-millisecond noise.

Panel 2: Connection Pool Activity

A stacked bar chart or time series showing active, idle, and waiting connections. Add a threshold line for your max_connections value (for example, 100).

# Active connections
avg by (state) (
  pg_stat_activity_count{datname="$database", state="active"}
)

# Idle connections
avg by (state) (
  pg_stat_activity_count{datname="$database", state="idle"}
)

# Waiting connections
avg by (state) (
  pg_stat_activity_count{datname="$database", state="waiting"}
)

Panel 3: Slow Query Log Dashboard (Loki)

This panel pulls slow query logs from Loki. Configure Loki as a data source first. Use the LogQL query below, filtered by the database variable.

{app="postgresql"} |= "duration:" |~ "duration: [5-9]\\d{2}ms|[0-9]+\\.[0-9]+s"

Set the visualization to Logs and enable the “Show labels” option to see the database name, user, and query string. This panel becomes your starting point for investigation.

Panel 4: Infrastructure Context (CPU, Memory, Network)

Overlay database metrics with host-level metrics. Use annotations to mark deployments or configuration changes.

# Node CPU usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle", instance=~"db-.*"}[5m])) * 100)

Add a panel for network latency between the application and database:

# Network latency between app and db (requires blackbox_exporter)
probe_duration_seconds{target="your-database:5432"}

Using Grafana Explore for Deep-Dive Analysis

When a panel shows a latency spike, open Grafana Explore to investigate. Explore gives you an unbounded query interface. You are not limited by dashboard panels.

Step 1: Pinpoint the Time Window

Identify the exact start and end of the latency spike from your dashboard. Type these timestamps into the Explore time picker. This narrows the search dramatically.

Step 2: Pull Corresponding Traces from Tempo

Run a trace query using your application’s trace ID. If you do not have the trace ID, query by service name and duration.

Open the Tempo data source in Explore and run:

{ .service.name = "payment-worker" } && duration >= 2s

This returns all traces from payment-worker that took more than 2 seconds. Sort by duration descending. Open the longest trace.

Step 3: Correlate the Trace with Logs

Inside the trace view, click the database span. Tempo shows the span duration (for example, 1800ms) and any attributes (query string, database name, error code). Click “Show Logs” to query Loki for logs from that exact trace.

Grafana automatically passes the trace ID to Loki. You see the application log that corresponds to the span:

{service_name="payment-worker"} |= "traceID=abc123"

Now you have the full chain: the trace shows a slow database call, the logs show the exact query that executed, and the database metrics show the lock wait time.

Troubleshooting Connection Pool Latency

Connection pool issues do not always show as high latency in the database itself. They show as connection wait time in your application metrics. If you do not track application-side connection pool metrics, you miss this entirely.

Spotting Connection Starvation

Look at your connection pool panel. If waiting connections are consistently above zero while active is at max_connections, your application is starving for connections.

The cause is almost always one of these:

  • max_connections set too low for peak traffic.
  • Pool size mismatch: your application connection pool (for example, HikariCP with maximumPoolSize=50) exceeds the database max_connections=100 when multiplied across application replicas. Three replicas at 50 each = 150, which starts failing.
  • Long-running queries holding connections open. A missing index causing a 5-second sequential scan on a frequently-read table will hold a connection for the entire duration.

Intermittent Connection Refused Errors

When max_connections is breached, the database refuses new connections. Check your application logs in Loki for connection refused strings:

{service_name="your-app"} |= "connection refused"

If this pattern appears during latency spikes, you have a tight pool. Increase max_connections gradually (test with 20% increments) and verify your application pool total across all replicas stays under the new limit.

Correlating Latency with Infrastructure and Application Events

A latency spike at 14:32 may have nothing to do with the database. It could be a noisy neighbor on the host, a DNS resolution delay, or a Kubernetes pod restart that left connections in a half-open state.

Adding Deployment Annotations

In your dashboard, go to Panel Settings > Annotations. Add a query annotation for your deployment events:

# Deployments (from your CI metrics or Kubernetes events)
timestamp(kube_deployment_created{namespace="production"})

Set the annotation title to $deployment and text to Deployed version $version. Now every deployment is marked on your latency graphs. If latency spikes align with a deployment, you know where to look first.

Monitoring Network Latency Separately

Network latency between the application and database is a common root cause in cloud environments. Use the blackbox_exporter to probe the database port and surface the metrics in Grafana.

Add a panel showing probe_duration_seconds for the database target. If you see a spike in network latency at the same time as your query latency spike, the database is not the problem; the network is.

Alerting and Proactive Detection

Waiting for a user to report slowness is not acceptable. Set up Grafana alert rules for the patterns that precede outages.

Alert Rule: p99 Latency Threshold

A simple but effective rule: alert when p99 query latency exceeds 500ms for 5 minutes.

  1. Go to Alerting > Alert Rules > New Alert Rule.
  2. Select your Prometheus data source.
  3. Query:
histogram_quantile(0.99,
  sum(rate(pg_stat_activity_max_tx_duration_seconds_bucket{datname!~"template.*"}[5m]))
)
  1. Set condition: WHEN last() > 0.5.
  2. Set evaluation interval: 5m.
  3. Set pending period: 5m (reduces flapping).

Alert Rule: Connection Pool Exhaustion

Alert before you hit max_connections. If your max_connections is 100, alert when total connections exceed 80.

sum(pg_stat_activity_count) > 80

Alert Routing to Slack or PagerDuty

Connect a notification policy:

  • Go to Alerting > Contact Points.
  • Add a contact point for Slack (webhook URL) or PagerDuty (integration key).
  • Create a notification policy that routes the “Critical” severity label to this contact point.

For escalation beyond the first response, Grafana OnCall can route alerts to different teams based on label matching.

Anomaly Detection vs. Static Thresholds

Static thresholds work for known patterns (for example, p99 over 500ms). But some latency spikes are gradual and stay under the threshold until suddenly they do not. Use Grafana’s machine learning features (or a Prometheus recording rule with standard deviation) to detect anomalous behavior.

A simple approach: calculate the rolling 7-day average latency and alert when current latency exceeds 3 standard deviations above the average.

# Rolling 7-day average of p99 latency
avg_over_time(
  histogram_quantile(0.99,
    rate(pg_stat_activity_max_tx_duration_seconds_bucket[5m])
  )[7d:5m]
)

Compare this baseline to live values. An alert triggers when the live value drifts significantly.

Prevention: Build the Dashboard Before You Need It

Do not wait for an outage to build this dashboard. Set it up now, during normal operation, so you have baseline metrics. The annotation layer showing normal deployment cadence becomes your historical record. When a latency spike happens, you can immediately check “did we deploy something that changed query patterns?”

The key insight: most database latency issues are caused by application changes, not infrastructure decay. A slow query today was fast yesterday because the application started sending different parameters or joined an extra table. Your Grafana correlation workflow (metrics -> logs -> traces) catches exactly that.

For teams running Kubernetes, understanding how pod restarts affect database connections is critical. The guide on Kubernetes troubleshooting: why did my pod die covers the infrastructure side of this equation. Similarly, if you manage connection pools with Terraform and your max_connections configuration is in IaC, the article on Terraform state locking explains how to safely update those values without competing apply operations.

FAQ

Q: What is the fastest way to find a slow query in Grafana?

Open Grafana Explore for your Loki data source and run a query for log entries containing duration: or slow query. Most databases log queries that exceed a configurable threshold (for example, PostgreSQL log_min_duration_statement = 2000). Filter by a 5-minute window around the latency spike. The logs show the exact query string and execution time.

Q: How do I know if the database or the network is causing latency?

Add a panel for probe_duration_seconds from the blackbox_exporter targeting your database port. If this metric spikes at the same time as your query latency spike, the network is the bottleneck. If query latency spikes but network latency stays flat, the database or application is the bottleneck. This single overlay eliminates the most common false lead.

Q: Is there a difference between latency and connection pool exhaustion? How do I tell?

Yes. Latency means queries take longer to execute. Connection pool exhaustion means queries never start because no connection is available. In Grafana, a latency spike shows in your query latency panel. Connection exhaustion shows in your connection pool panel (active connections at max, waiting connections climbing). You can also check application logs for “connection refused” or “pool exhausted” errors. These require different fixes: latency needs query optimization or indexing; exhaustion needs pool size tuning or replica scaling.

Advertisement

Stay up to date

Get DevOps tips, tutorials, and guides delivered to your inbox.