Kafka Incident Triage from Datadog Alerts

← All use cases

Data Streaming — Confluent Kafka

SRE, platform engineering, on-call rotations — any team running Kafka with Datadog

Confluent CloudDatadogincident responseSREon-call

The problem

On-call engineer wakes up to 'orders-service-consumer-lag > 50000' from Datadog. They open Datadog (which dashboard?), the Confluent Cloud console (which cluster?), the deploy log (which service?), and Slack (anyone else seeing this?). 30+ minutes before they've even formed a hypothesis.
Lag spikes are usually caused by a small number of patterns — recent rebalance, slow downstream, schema change, deploy. The senior engineer knows which to check first; the junior engineer learns by paging the senior.
Even when the cause is identified, writing the post-incident summary means re-reading dashboards and pasting screenshots. The audit trail is rebuilt every time.
Mean-time-to-diagnose grows linearly with team size as institutional knowledge dilutes.

How InsightWorker handles it

Trigger from a Datadog monitor webhook (gateway mode) or scheduled poll — 'every 2 minutes, check if any monitor tagged kafka:* is alerting'. datadog_list_monitors · scheduler · gateway webhook

For each alerting monitor, identify the Kafka resource — consumer group, topic, or connector — from the monitor's tags. agent reasoning

Pull the consumer group's current state, member assignments, and rebalance history (last hour) via Confluent Cloud REST API. confluent_consumer_groups

Pull recent Connect connector failures touching the same topics (last hour). confluent_connect

Pull recent schema changes for the topic's subject (last 24h) — a breaking schema change with a still-deployed consumer is the most common cause people miss. confluent_schema_registry

Query Datadog for the producer rate and downstream service latency — has the producer surged, or has the consumer slowed? datadog_query_metrics

Synthesize a triage summary via Bedrock — probable cause ranked by likelihood, suggested first three actions, escalation criteria. Posted to the incident Slack channel and saved as incident_<id>.md. agent reasoning · send_email or Slack adapter

Sample prompt

"Datadog just paged for orders-svc-consumer-lag — pull context and post the triage to #incidents."

Deliverables: incident_<id>.md · suggested_actions.md · escalation note (if criteria met) · audit trail per incident

Prefer the browser?

Run this in InsightStudio — no CLI install for the user.

Authors publish the app once with iw app publish; business users open it in the marketplace and click Run. Your worker box does the execution.

Visit InsightStudio →

Try this use case yourself

Free trial available — CLI, Desktop, VS Code, and the new --worker mode for InsightStudio. See download for details.

Download Free Trial