It's 3:14am. PagerDuty fires. A Prometheus alert — ingestion lag has crossed the threshold. The L1 muscle memory kicks in: open the runbook and find the alert by name; read the verification steps listed for it; open Grafana to check the ingestion rate graphs; switch to BigQuery and run the SQL to verify whether records are actually coming through; cross-reference what you see; decide whether this warrants labeling it a real incident or closing it as noise; then either resolve it in PagerDuty or assign it to the right L2 engineer. Twenty minutes in, you still haven't done anything that fixed the problem — you've been working through a checklist.
If you've been on call, you've been here. The first fifteen minutes of every page are roughly identical, and they're roughly identical to the first fifteen minutes of every page in your career. The cognitive overhead at 3am is what makes oncall hard, not the actual decisions.
My alerting stack is Prometheus feeding PagerDuty, with Grafana for pipeline metrics and BigQuery as the data source I query to confirm ingestion health. When a page lands, my job as L1 is to follow the runbook, run its verification steps — which often means checking Grafana dashboards and querying BigQuery — label the alert in PagerDuty, and route it. That whole loop is what I automated with InsightWorker. It's not a replacement for me. It does the repetitive verification work so I show up to the actual judgement call already informed.
Alert to Resolution — How the Agent Works
Prometheus
Alert rule threshold crossed
PagerDuty
Page fires, on-call notified
InsightWorker Agent
Triggered via webhook, ack's page
Runbook Matched
Verification steps loaded for this alert
Grafana
Ingestion rate panels pulled
+
BigQuery
SQL verification query run
Alert Labeled in PagerDuty
Real incident confirmed or noise flagged
Resolved
Agent closes in PagerDuty
or
Assign to L2
Structured handoff with full context
What the L1 Virtual Support Agent does
The agent runs as a daemon and listens for PagerDuty webhooks. When a page fires, it executes a fixed sequence inside the incident's Slack thread:
- Acknowledge the page in PagerDuty so the on-call rotation isn't paged twice.
- Find the runbook matching the alert name in our docs / Confluence / GitHub wiki.
- Pull Grafana panels for the affected service over the last hour, drop the snapshots in the incident thread.
- Grep recent logs for the service, classify the top error types, attach the top three.
- Diff the last deploy against current production — is this incident likely deploy-induced or independent?
- Suggest the runbook's first step (restart, drain, failover) and wait for me to approve before executing.
- Escalate to L2 with a structured handoff if the runbook doesn't apply or the first step didn't help.
Six things, all of which I used to do by hand at 3am, all of which the agent does in under ninety seconds.
An actual incident from last week — the timeline
Tuesday, 3:14am. Real page. Alert: ingestion_lag_minutes > 30 for 10m — one of our data ingestion pipelines had gone quiet. Here's the timeline as it actually played out, with the agent driving:
03:14:08 page fires (PagerDuty ← Prometheus)
03:14:18 agent: PagerDuty ack'd
03:14:22 agent: matched runbook → "ingestion_lag_minutes" v5 (last edited 6 days ago)
03:14:31 agent: runbook verification steps for this alert:
step 1 — check Grafana ingestion rate panel (last 60m)
step 2 — query BigQuery: row counts in staging vs expected
step 3 — check upstream source connector status
03:14:48 agent: Grafana — ingestion rate panel (last 60m) captured
→ ingestion dropped to 0 at 03:04, still flat
03:15:10 agent: BigQuery verification query run:
SELECT DATE(event_time), COUNT(*) FROM dataset.events
WHERE event_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 HOUR)
GROUP BY 1
→ last hour: 0 rows. previous hour: 142,800 rows. ingestion stopped.
03:15:22 agent: upstream connector status — source API returning 401 (token expired)
03:15:28 agent: runbook step 3 match confirmed — credential rotation needed
→ permission needed to rotate source API token
03:15:50 me: approve
03:15:51 agent: token rotated, connector restarted
03:17:30 agent: Grafana — ingestion rate recovering (2,400 rows/min and climbing)
03:18:55 agent: BigQuery row count back to expected rate
03:21:00 agent: 5 minutes stable. labeling alert as resolved in PagerDuty.
03:21:05 page resolved.
Seven minutes from page to resolution. The same alert six months ago — before InsightWorker — took me 34 minutes: finding the right Grafana dashboard, remembering the BigQuery query to run, cross-referencing the output against the runbook, and then manually resolving in PagerDuty. Most of that time was mechanical. None of it required judgment.
Where human approval stays in place
Every command that changes production state is permission-gated. Restart a service? I approve. Drain a node? I approve. Failover? I approve. The agent stages the action, shows me the exact command + change set, and waits. I tap a button. It executes.
Read-only operations — Grafana snapshots, log greps, deploy diffs, status-page lookups, Slack messages — auto-execute. That's the right line. The agent can investigate freely; it can't change state without me.
I also keep a hard list of things the agent never does, ever, even with my approval, because the blast radius is too high or the failure mode is too costly:
- Database schema changes
- DNS / TLS certificate operations
- Anything that touches secrets manager or KMS
- Customer comms (status page updates, support emails) — drafts only, I send
- L2/L3 escalation — the agent can prepare the handoff but the page-out goes through me
When the agent says 'I don't know'
The most important behavior I had to teach the agent is when to give up. If the runbook doesn't match, or if the first runbook step didn't fix the problem, or if the metrics don't match the runbook's preconditions, the agent stops and escalates with a structured handoff to L2:
Incident: #inc-2287 · checkout-latency-p99 · payment-gateway
Runbook attempted: "checkout-latency-p99" v3, step 1 (failover)
Result: failover enabled, error rate did NOT drop in 5m
Additional context I gathered:
- 47x upstream timeout to stripe-api (94% of errors)
- stripe.com status page: degraded
- last deploy: 14h ago, unrelated services
L1 escalating because runbook step 1 did not resolve.
Suggested L2 starting points:
- Verify retry queue is actually receiving traffic (Grafana panel)
- Check upstream dependency (stripe-api) for sustained outage
- Consider customer-facing comms
The handoff is the bit that mattered most for the team's L2 engineers. Before, my escalations were a Slack message that read like "hey, payment-gateway latency, stripe might be involved, can you take a look?" Now they're a runbook trace + the metrics + the dependency state + the L1 attempt — all in the thread before the L2 engineer has finished waking up. The L2 engineers told me the difference is night-and-day.
Setup — what's actually in the app
Three pieces wire it together:
First, a webhook from PagerDuty that fires the agent. The agent runs as a long-lived process and listens for events from a small forwarder we run in our cluster (the scheduler daemon, insightworker --daemon, watches a queue file the forwarder writes to):
schedules:
- name: l1-incident-triage
trigger: webhook
webhook_path: /pagerduty
app: l1-incident-triage
Second, an app config that defines the integrations and the safety guardrails:
# .insightworker/apps/l1-incident-triage/config.yaml
app: l1-incident-triage
integrations:
pagerduty:
api_key_env: PAGERDUTY_API_KEY
prometheus:
base_url: https://prometheus.internal
api_key_env: PROMETHEUS_API_KEY
grafana:
base_url: https://grafana.internal
api_key_env: GRAFANA_API_KEY
bigquery:
project_id: my-data-project
credentials_env: BIGQUERY_SA_KEY
slack:
bot_token_env: SLACK_BOT_TOKEN
incident_channel_prefix: "#inc-"
runbook_source:
type: confluence
space: SRE
base_url: https://acme.atlassian.net
safety:
permission_gated_actions:
- bash
- bigquery_write
- connector_restart
forbidden_actions:
- schema_migration
- dns_change
- secret_rotation
Third, an app prompt that orchestrates the steps using InsightWorker's tools — web_fetch for PagerDuty and Grafana panels, bigquery_query for running the SQL verification steps from the runbook, bash for connector restarts and credential rotation. Nothing custom. The agent loop chains them automatically based on the runbook's verification steps for each alert.
What broke during the first month
Three things bit me, all worth flagging:
- Runbook drift. Our runbooks were stale — half referenced services that had been renamed two quarters ago. The agent followed them faithfully and broke. We did a runbook audit week (the agent helped — it scanned every runbook, flagged service-name mismatches against current Kubernetes deployments, and we fixed the top 30 in two afternoons).
- Alert noise. The agent ran on every page, including the noisy ones ("disk usage at 71% on a node we don't care about"). Within a week the team was tired of agent posts in incident channels. We added a noise filter — alerts under priority P3 don't trigger the L1 agent, they go to a daily digest instead. That digest is its own scheduled app now.
- Agent-in-the-loop fatigue. The first version asked me to approve every read-only action too. I started ignoring approvals because they were all benign. Bad pattern — the day a real destructive action came up I might have approved it on autopilot. We split the permissions cleanly: read-only operations auto-approve; anything that changes state requires explicit approval, and the approval message names the change set.
What I still own
Human-only — these don't move to the agent
1
Severity calls. Is this customer-impacting? Should we comm externally? Should we wake an exec? The agent surfaces signals; I make the call.
2
Escalation timing. The agent escalates to L2 when the runbook doesn't apply or step 1 didn't help. But "this is bigger than runbook scope" is a judgment call that's mine.
3
Post-incident framing. The agent drafts a clean post-incident summary from the timeline, but the narrative — what failed, why we didn't catch it sooner, what we'll change — is my writeup, with my team's review.
If you're worried about an agent making bad on-call decisions, set the gates the same way I did. Read freely. Change nothing without approval. Escalate cleanly. Never own customer comms or exec wake-ups. The agent does what it can demonstrably do well; the human keeps what only a human should do.
Other DevOps / SRE apps where the same pattern fits
After the L1 triage agent stabilized, I bolted on four more apps that share the same tools and the same daemon:
- Daily alert hygiene digest. Every weekday morning, summarize the past 24h of pages: which alerts fired, which were resolved by L1 vs L2, which were noise, which need runbook updates. Drives our weekly oncall review meeting.
- Deploy verification. After every production deploy, the agent runs a 5-minute checklist (error rate, latency, health endpoints, db connections) and posts a thumbs-up or a structured rollback recommendation.
- SLO breach root-cause sketch. When an SLO burn-rate alert fires, the agent runs the same triage shape but optimized for the longer-window analysis instead of an acute spike.
- Cert + secret expiry forecast. Weekly scan of certificates and secret rotations due in the next 30 days, posts to the SRE channel with renewal owners.
Other SRE areas this approach plausibly handles
Haven't tried these yet, but the fit is obvious:
- Synthetic monitoring failure triage — same pattern as L1 incident, narrower input shape
- Capacity + cost anomaly digest — daily diff of resource usage vs. forecast, flag the outliers
- Database slow-query triage — pull pg_stat_statements / Performance Insights, classify by query family, suggest the index or rewrite
- Kubernetes pod crashloop investigator — auto-pulls events, logs, recent deploys for any pod in CrashLoopBackOff
- On-call shift handoff brief — at the end of each shift, the agent assembles a summary of open issues + context for the incoming oncall
- Chaos test reviewer — after every game-day or chaos run, the agent assembles the failure modes, the system response, and the gaps
- Postmortem first draft — given the incident timeline + Slack thread + runbook trace, draft the postmortem doc the team edits
What I'd tell another SRE team starting today
Pick the alert that pages your L1 most often. Write down the first fifteen minutes of work you do every time it fires. That's your prompt. Run it for a week with all destructive actions disabled (read-only mode). Watch where it gets stuck or wrong. Fix the prompt. Then turn on permission-gated execution for the runbook's first step. Then for the second. Don't open the gates faster than your trust grows.
After six months on this, the part of oncall I dread is gone. Not because there are fewer pages — there are about the same — but because the cognitive lift of the first fifteen minutes is mostly the agent's. I show up at minute three, not minute twenty, with the context already in front of me. I make the call. The agent does the rest.
The use-case page walks through the app with screenshots of an actual incident triage and the L2 handoff format if you want to see what each step looks like in context.