InsightWorker Logo

Kafka Incident Triage from Datadog Alerts

When Datadog pages your on-call at 3 AM for a consumer-lag spike, InsightWorker pulls the Kafka context, recent rebalances, related connector failures, and recent schema changes — and ships a triage summary before the engineer finishes their first coffee.

← All use cases
Data Streaming — Confluent Kafka
SRE, platform engineering, on-call rotations — any team running Kafka with Datadog
Confluent CloudDatadogincident responseSREon-call

The problem

  • On-call engineer wakes up to 'orders-service-consumer-lag > 50000' from Datadog. They open Datadog (which dashboard?), the Confluent Cloud console (which cluster?), the deploy log (which service?), and Slack (anyone else seeing this?). 30+ minutes before they've even formed a hypothesis.
  • Lag spikes are usually caused by a small number of patterns — recent rebalance, slow downstream, schema change, deploy. The senior engineer knows which to check first; the junior engineer learns by paging the senior.
  • Even when the cause is identified, writing the post-incident summary means re-reading dashboards and pasting screenshots. The audit trail is rebuilt every time.
  • Mean-time-to-diagnose grows linearly with team size as institutional knowledge dilutes.

How InsightWorker handles it

1
Trigger from a Datadog monitor webhook (gateway mode) or scheduled poll — 'every 2 minutes, check if any monitor tagged kafka:* is alerting'. datadog_list_monitors · scheduler · gateway webhook
2
For each alerting monitor, identify the Kafka resource — consumer group, topic, or connector — from the monitor's tags. agent reasoning
3
Pull the consumer group's current state, member assignments, and rebalance history (last hour) via Confluent Cloud REST API. confluent_consumer_groups
4
Pull recent Connect connector failures touching the same topics (last hour). confluent_connect
5
Pull recent schema changes for the topic's subject (last 24h) — a breaking schema change with a still-deployed consumer is the most common cause people miss. confluent_schema_registry
6
Query Datadog for the producer rate and downstream service latency — has the producer surged, or has the consumer slowed? datadog_query_metrics
7
Synthesize a triage summary via Bedrock — probable cause ranked by likelihood, suggested first three actions, escalation criteria. Posted to the incident Slack channel and saved as incident_<id>.md. agent reasoning · send_email or Slack adapter

Sample prompt

"Datadog just paged for orders-svc-consumer-lag — pull context and post the triage to #incidents."
Deliverables: incident_<id>.md · suggested_actions.md · escalation note (if criteria met) · audit trail per incident
Prefer the browser?
Run this in InsightStudio — no CLI install for the user.

Authors publish the app once with iw app publish; business users open it in the marketplace and click Run. Your worker box does the execution.

Visit InsightStudio →

Try this use case yourself

Free trial available — CLI, Desktop, VS Code, and the new --worker mode for InsightStudio. See download for details.

Download Free Trial