Automated Incident Response with n8n & Monitoring Tools

1. Incident Response Process

Effective incident management follows a structured approach that automation can enhance at every stage:

Preparation

Documentation: Maintain runbooks and response procedures
Monitoring setup: Configure Prometheus alerts with appropriate thresholds
Team readiness: Ensure on-call schedules and escalation paths are clear

Detection & Analysis

Alert triggering: Prometheus detects anomalies based on predefined rules
Initial assessment: Automated classification of severity and impact
Contextual enrichment: Adding relevant system information to alerts

Recovery

Notification routing: Directing alerts to appropriate channels (PagerDuty, Slack)
Guided response: Linking to relevant runbooks or documentation
Automated remediation: Triggering recovery scripts for known issues

Post-incident Activity

Documentation: Capturing timeline and response details
Analysis: Identifying root causes and prevention opportunities
Process improvement: Updating monitoring rules and response workflows

In the below steps, we will make a automation flow that we cover the Detection & Analysis and routing the alert based on the contexts.

2. Building an n8n Incident Response Workflow

Let's create a practical workflow that handles Prometheus alert management intelligently, ensuring critical issues will be sent while considering the severity, business hours,…

Screenshot 2025-05-26 at 16.01.03.png — This flow will demonstrate basic flow to handle incidents before your team is ready

Workflow Overview

Our goal is to build a system that:

Receives alerts from Prometheus/AlertManager
Analyzes severity and business hours
Routes critical after-hours alerts to PagerDuty for immediate response
Sends less urgent or business-hours alerts to Slack/Discord
Automate incident resolution based on AI-agent suggestion
Lambda function to execute your operations (you can custom for ECS, EKS,...)
Documents all incidents in a structured format (Notion)

This approach ensures your team isn't woken up unnecessarily while maintaining rapid response for genuine emergencies.

Setting Up the Monitoring Stack

Before building our n8n workflow, we need a properly configured monitoring system:

Install the Prometheus stack
- Prometheus server for metrics collection
- Exporters for your specific services
- AlertManager for alert handling
- Grafana for visualization
To easily configure the stack, I’ve made a click-run scripts to deploy “Node Exporter - Prometheus - Alert Manager - Grafana Stack” here. You can take a look, pull and deploy yourself

Configure alert rules in Prometheus

Below is the sample code to trigger the alerts

groups:
- name: example
  rules:
  - alert: HighCPU
    expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle",job="exporters"}[5m])) * 100) > 10
    for: 1m
    labels:
      severity: critical
      service: node
    annotations:
      description: "CPU usage on {{ $labels.instance }} is {{ $value }}%"
  - alert: HighLatency
    expr: rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]) > 0.5
    for: 1m
    labels:
      severity: warning
      service: web
    annotations:
      description: "Response time on {{ $labels.instance }} is {{ $value }}s"

Set up AlertManager to forward to n8n

route:
  receiver: 'n8n-webhook'
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 30m

receivers:
- name: 'n8n-webhook'
  webhook_configs:
  - url: '<http://your-n8n-instance:5678/webhook/prometheus>'
    send_resolved: true

Building the n8n Workflow

Now let's create our n8n workflow to process these alerts intelligently:

Create a Webhook node
- This will receive alerts from AlertManager
- Configure it to parse JSON payloads
- Note: To use "Production URL", you have to set the workflow to "Active"

Add a Function node for alert classification

To analyze the payload by Prometheus, we will try to parse the json. Also, we will calculate the business hour and the duration of the incident, that will help use evaluate the incident better.

Add the code node, input the following Javascript code:

const alerts = items[0].json.body.alerts || [];
return alerts.map(alert => ({
  const startsAt = new Date(alert.startsAt);
  const endsAt = new Date(alert.endsAt);
  const hour = endsAt.getUTCHours();
  const isBusinessHours = hour >= 9 && hour < 17; // 9 AM–5 PM UTC
  const durationMinutes = (endsAt - startsAt) / 1000 / 60; // Duration in minutes
  json: {
    status: alert.status, // firing or resolved
    alertname: alert.labels.alertname, // e.g., HighCPU
    severity: alert.labels.severity, // e.g., critical
    instance: alert.labels.instance, // e.g., 47.129.163.27:9100
    service: alert.labels.service, // e.g., node
    description: alert.annotations.description, // e.g., CPU usage description
    startsAt: alert.startsAt, // e.g., 2025-05-25T06:40:29.682Z
    endsAt: alert.endsAt, // e.g., 2025-05-25T06:42:59.682Z
    fingerprint: alert.fingerprint, // e.g., 80e7d055dbb50b48
    isBusinessHours: isBusinessHours, // true if within 9 AM–5 PM UTC
    durationMinutes: durationMinutes // Duration in minutes
  }
}));

Add a Switch node for routing
- Route based on criticality and business hours
- Create three paths:
  - Critical + After Hours → PagerDuty
  - Critical + Business Hours → Discord (urgent channel)
  - Non-critical → Discord (general alerts channel)
Configure service integrations
- PagerDuty node:
  - Connect to your PagerDuty service
  - Map alert details to incident fields
  - Set appropriate urgency
- Discord node:
  - Create formatted messages with alert details
  - Include links to Grafana dashboards
  - Add runbook references if available
Add Notion integration for documentation
- Create a Notion database node
- Log all incidents with timestamps, severity, and response details
- Include resolution status and follow-up tasks

When running the above flow, you will routing the alert based on your own defined rules, this is just my example:

Medium alerts are sent to Discord, team can check later.

Critical alerts not within business hours or in holidays are sent to PagerDuty, you can configure some calling or escalation rules on PagerDuty's dashboard.

The alerts are sent to PagerDuty, you can make some on-call rules or escalation here

Handle the incident resolution (basic)

Suggested prompt for the AI-agent to handle the context for you:

Analyze the following Prometheus alert to determine if it should be auto-resolved by restarting the EC2 instance to handle issues like high CPU usage, especially when the team is unavailable. The context is:

- Alert Name: {{ $node["Code"].json["alertname"] }}
- Severity: {{ $node["Code"].json["severity"] }}
- Duration: {{ $node["Code"].json.durationMinutes }} minutes
- Business Hours: {{ $node["Code"].json["isBusinessHours"] }} (true if 9 AM–5 PM UTC, false otherwise)
- Description: {{ $node["Code"].json["description"] }}

Extract the CPU usage (X%) from the description, formatted as: "On <instance> at <alertname>: CPU usage is X%, Memory available is Y%, Swap usage is Z%, Disk I/O is A s, Network received is B MB/s, Latency is C s".

Decide to auto-resolve (restart the EC2 instance) if:
1. CPU usage > 80% AND outside business hours (isBusinessHours is false).
2. CPU usage > 90% AND duration < 5 minutes.
3. Severity is "critical" AND outside business hours (isBusinessHours is false).

Return only the following JSON object, with no additional text, explanations, or markdown:
{
  "shouldAutoResolve": boolean,
  "reason": "Explanation of the reason why this action should or should not be auto-resolved, referencing CPU usage, duration, severity, and business hours if relevant."
}

- If shouldAutoResolve is true, a Lambda function will be triggered to restart the EC2 instance.
- If shouldAutoResolve is false, no restart will occur.
- Keep the reason concise and clear, referencing the specific criteria met or not met.
- If CPU usage cannot be extracted, assume 0% and include it in the reason.

It will follow your expected context, and decide to execute restart the service or not:

4. Enhancing the Workflow

You can enhance the workflow based on your need, I will recommend some approaches:

The notification phase

N8n support various types of integrations like Slack, Telegram, Rocketchat,...You can integrate whatever you need

The analyzing phase

You can integrate with AI agent node, LLM-chain or OpenLLM, to evaluate the metrics then decider where to route the incidents.

You can improve the workflow much more like:

Implement smart silencing: By aggregation the metrics from Prometheus alert management, we can use n8n AI node to suppress or escalate the problem to team.
External dependency checks: We can integrate with some external services like DNS, vercel, AWS,… to make sure the incident is correctly handled.

The logging incidents phase

You can integrate into storing place like GoogleSheet, MongoDB, SQL,... Many database kind are supported

4. Sample codes and workflow

Incident Response Workflow (json) - https://github.com/Bubobot-Team/automation-workflow-monitoring/blob/main/n8n/n8n___Incident_Response___1.json
Click-Run deployment of Node Exporter - Prometheus - Alert Manager - Grafana Stack: https://github.com/Bubobot-Team/monitoring-stack/tree/main/stacks/prometheus-stack

Beyond Basic Monitoring: Consider Bubobot

While the Prometheus alert management and n8n integration provides powerful capabilities, organizations with critical uptime requirements may need more comprehensive solutions. Bubobot offers advanced monitoring with the shortest check intervals in the market and AI-enhanced anomaly detection that can identify issues before they trigger traditional thresholds.

Automated Incident Response Workflows with n8n and Monitoring Tools