Automated Incident Response Workflows with n8n and Monitoring Tools

Screenshot_2025_05_26_at_16_04_44_5866d7d484.png

May 25, 2025

1. Incident Response Process

Effective incident management follows a structured approach that automation can enhance at every stage:

Preparation

  • Documentation: Maintain runbooks and response procedures
  • Monitoring setup: Configure Prometheus alerts with appropriate thresholds
  • Team readiness: Ensure on-call schedules and escalation paths are clear

Detection & Analysis

  • Alert triggering: Prometheus detects anomalies based on predefined rules
  • Initial assessment: Automated classification of severity and impact
  • Contextual enrichment: Adding relevant system information to alerts

Recovery

  • Notification routing: Directing alerts to appropriate channels (PagerDuty, Slack)
  • Guided response: Linking to relevant runbooks or documentation
  • Automated remediation: Triggering recovery scripts for known issues

Post-incident Activity

  • Documentation: Capturing timeline and response details
  • Analysis: Identifying root causes and prevention opportunities
  • Process improvement: Updating monitoring rules and response workflows

In the below steps, we will make a automation flow that we cover the Detection & Analysis and routing the alert based on the contexts.

2. Building an n8n Incident Response Workflow

Let's create a practical workflow that handles Prometheus alert management intelligently, ensuring critical issues will be sent while considering the severity, business hours,…

Screenshot 2025-05-26 at 16.01.03.png
This flow will demonstrate basic flow to handle incidents before your team is ready

Workflow Overview

Our goal is to build a system that:

  • Receives alerts from Prometheus/AlertManager
  • Analyzes severity and business hours
  • Routes critical after-hours alerts to PagerDuty for immediate response
  • Sends less urgent or business-hours alerts to Slack/Discord
  • Automate incident resolution based on AI-agent suggestion
  • Lambda function to execute your operations (you can custom for ECS, EKS,...)
  • Documents all incidents in a structured format (Notion)

This approach ensures your team isn't woken up unnecessarily while maintaining rapid response for genuine emergencies.

Setting Up the Monitoring Stack

Before building our n8n workflow, we need a properly configured monitoring system:

  1. Install the Prometheus stack

    • Prometheus server for metrics collection
    • Exporters for your specific services
    • AlertManager for alert handling
    • Grafana for visualization

    To easily configure the stack, I’ve made a click-run scripts to deploy “Node Exporter - Prometheus - Alert Manager - Grafana Stack” here. You can take a look, pull and deploy yourself

  2. Configure alert rules in Prometheus

    image.png

    Below is the sample code to trigger the alerts

    groups:
    - name: example
      rules:
      - alert: HighCPU
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle",job="exporters"}[5m])) * 100) > 10
        for: 1m
        labels:
          severity: critical
          service: node
        annotations:
          description: "CPU usage on {{ $labels.instance }} is {{ $value }}%"
      - alert: HighLatency
        expr: rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]) > 0.5
        for: 1m
        labels:
          severity: warning
          service: web
        annotations:
          description: "Response time on {{ $labels.instance }} is {{ $value }}s"
  3. Set up AlertManager to forward to n8n

    route:
      receiver: 'n8n-webhook'
      group_by: ['alertname', 'instance']
      group_wait: 30s
      group_interval: 1m
      repeat_interval: 30m
    
    receivers:
    - name: 'n8n-webhook'
      webhook_configs:
      - url: '<http://your-n8n-instance:5678/webhook/prometheus>'
        send_resolved: true

Building the n8n Workflow

Now let's create our n8n workflow to process these alerts intelligently:

  1. Create a Webhook node
    • This will receive alerts from AlertManager
    • Configure it to parse JSON payloads
    • image.png

      Note: To use "Production URL", you have to set the workflow to "Active"

  2. Add a Function node for alert classification

    To analyze the payload by Prometheus, we will try to parse the json. Also, we will calculate the business hour and the duration of the incident, that will help use evaluate the incident better.

    Add the code node, input the following Javascript code:

    const alerts = items[0].json.body.alerts || [];
    return alerts.map(alert => ({
      const startsAt = new Date(alert.startsAt);
      const endsAt = new Date(alert.endsAt);
      const hour = endsAt.getUTCHours();
      const isBusinessHours = hour >= 9 && hour < 17; // 9 AM–5 PM UTC
      const durationMinutes = (endsAt - startsAt) / 1000 / 60; // Duration in minutes
      json: {
        status: alert.status, // firing or resolved
        alertname: alert.labels.alertname, // e.g., HighCPU
        severity: alert.labels.severity, // e.g., critical
        instance: alert.labels.instance, // e.g., 47.129.163.27:9100
        service: alert.labels.service, // e.g., node
        description: alert.annotations.description, // e.g., CPU usage description
        startsAt: alert.startsAt, // e.g., 2025-05-25T06:40:29.682Z
        endsAt: alert.endsAt, // e.g., 2025-05-25T06:42:59.682Z
        fingerprint: alert.fingerprint, // e.g., 80e7d055dbb50b48
        isBusinessHours: isBusinessHours, // true if within 9 AM–5 PM UTC
        durationMinutes: durationMinutes // Duration in minutes
      }
    }));

     

  3. Add a Switch node for routing
    • Route based on criticality and business hours
    • Create three paths:
      • Critical + After Hours → PagerDuty
      • Critical + Business Hours → Discord (urgent channel)
      • Non-critical → Discord (general alerts channel)
  4. Configure service integrations
    • PagerDuty node:
      • Connect to your PagerDuty service
      • Map alert details to incident fields
      • Set appropriate urgency
    • Discord node:
      • Create formatted messages with alert details
      • Include links to Grafana dashboards
      • Add runbook references if available
  5. Add Notion integration for documentation
    • Create a Notion database node
    • Log all incidents with timestamps, severity, and response details
    • Include resolution status and follow-up tasks

When running the above flow, you will routing the alert based on your own defined rules, this is just my example:

Medium alerts are sent to Discord, team can check later.

image.png

Critical alerts not within business hours or in holidays  are sent to PagerDuty, you can configure some calling or escalation rules on PagerDuty's dashboard.

image.png
The alerts are sent to PagerDuty, you can make some on-call rules or escalation here

Handle the incident resolution (basic)

Suggested prompt for the AI-agent to handle the context for you:

Analyze the following Prometheus alert to determine if it should be auto-resolved by restarting the EC2 instance to handle issues like high CPU usage, especially when the team is unavailable. The context is:

- Alert Name: {{ $node["Code"].json["alertname"] }}
- Severity: {{ $node["Code"].json["severity"] }}
- Duration: {{ $node["Code"].json.durationMinutes }} minutes
- Business Hours: {{ $node["Code"].json["isBusinessHours"] }} (true if 9 AM–5 PM UTC, false otherwise)
- Description: {{ $node["Code"].json["description"] }}

Extract the CPU usage (X%) from the description, formatted as: "On <instance> at <alertname>: CPU usage is X%, Memory available is Y%, Swap usage is Z%, Disk I/O is A s, Network received is B MB/s, Latency is C s".

Decide to auto-resolve (restart the EC2 instance) if:
1. CPU usage > 80% AND outside business hours (isBusinessHours is false).
2. CPU usage > 90% AND duration < 5 minutes.
3. Severity is "critical" AND outside business hours (isBusinessHours is false).

Return only the following JSON object, with no additional text, explanations, or markdown:
{
  "shouldAutoResolve": boolean,
  "reason": "Explanation of the reason why this action should or should not be auto-resolved, referencing CPU usage, duration, severity, and business hours if relevant."
}

- If shouldAutoResolve is true, a Lambda function will be triggered to restart the EC2 instance.
- If shouldAutoResolve is false, no restart will occur.
- Keep the reason concise and clear, referencing the specific criteria met or not met.
- If CPU usage cannot be extracted, assume 0% and include it in the reason.

It will follow your expected context, and decide to execute restart the service or not:

image.png

 

4. Enhancing the Workflow

You can enhance the workflow based on your need, I will recommend some approaches:

The notification phase

N8n support various types of integrations like Slack, Telegram, Rocketchat,...You can integrate whatever you need

The analyzing phase

You can integrate with AI agent node, LLM-chain or OpenLLM, to evaluate the metrics then decider where to route the incidents. 

You can improve the workflow much more like:

  • Implement smart silencing: By aggregation the metrics from Prometheus alert management, we can use n8n AI node to suppress or escalate the problem to team.
  • External dependency checks: We can integrate with some external services like DNS, vercel, AWS,… to make sure the incident is correctly handled.

The logging incidents phase

You can integrate into storing place like GoogleSheet, MongoDB, SQL,... Many database kind are supported

4. Sample codes and workflow

Beyond Basic Monitoring: Consider Bubobot

While the Prometheus alert management and n8n integration provides powerful capabilities, organizations with critical uptime requirements may need more comprehensive solutions. Bubobot offers advanced monitoring with the shortest check intervals in the market and AI-enhanced anomaly detection that can identify issues before they trigger traditional thresholds.