Monitoring workflows

Monitoring → Workflows is where you wire RMM-side automation: when a check fails, an agent drops off, an event log entry matches a pattern, a cron schedule fires, or an external system POSTs to your tenant — do something. Same shape as PSA Workflows, different event source.

This is the engine that used to be called "Triggers" before the May 2026 IA reorg. Old /triggers URLs redirect into here.

Anatomy of a workflow

A workflow has three parts:

Trigger type — which event source fires it. Five choices.
Condition — a JSON shape that filters which events of that type actually fire the workflow.
Actions — one or more things to do when the workflow fires.

Plus three guards:

Scope — leave global, or pin to a specific agent / site / client.
Cooldown — minimum gap between fires per workflow. Defaults to 5 minutes.
Enabled — toggle without deleting.

Triggers

`check_failure`

Fires whenever a monitoring check reports fail, error, warn, or warning. The condition can filter by:

check_type — cpu, memory, disk, service, ping, http, script, eventlog, or * for all
min_severity — error only fires on hard fails; warning includes both
consecutive_failures — only fire after N back-to-back failures, so a single transient spike doesn't trigger a script run

`agent_status`

Fires when an agent transitions online → offline or vice versa. Condition supports:

status — offline, online, or both
after_minutes — only fire if the agent has been in that state for at least N minutes (debounces flapping agents on a wobbly link)

`eventlog`

Fires when a Windows event log entry matches a pattern. Condition supports:

log_name — e.g. System, Application, Security
source — the event source (e.g. disk, Service Control Manager)
event_id — exact ID or list of IDs
level — Critical, Error, Warning, Information

`schedule`

Cron-style. Fires on a recurring schedule. Condition supports a standard cron expression. Use this for periodic housekeeping that isn't reactive to monitored events.

`webhook_inbound`

Exposes a per-tenant inbound URL. Any external system that POSTs to it (with the secret in the path) fires the workflow. Useful for stitching third-party alerting into OpsMerge.

Actions

Four kinds today:

`run_script`

Sends a script to the affected agent via NATS request/reply. The script runs as SYSTEM (Windows) or root (Linux/macOS). Action params:

script_id — pick from your script library
args — optional, comma-separated
timeout_seconds — script execution cap

`create_alert`

Inserts an alert row in the alerts table — same surface as failing-check alerts. Action params:

severity — info / warning / error
message — text, with simple template variables like {agent_name} and {check_name}

Useful when you want the workflow to surface in Monitoring → Alerts for triage, without coupling to script execution.

`fire_webhook`

POSTs a JSON payload to a URL you supply. Action params:

url — must be public (SSRF protection blocks RFC 1918 / loopback / link-local)
payload_template — optional JSON body; falls back to a default {trigger_id, agent_id, condition_data} shape

`create_ticket`

Creates a PSA ticket from the workflow context. Action params:

title_template — supports {agent_name}, {check_name}, {message}
priority — low / medium / high / urgent
assigned_team_id — optional team to land on
category_id — optional ticket category

One alert incident maps to exactly one ticket. For check_failure and agent_status workflows the ticket is keyed to the underlying alert, so repeat fires while the alert is still active update nothing, however often the workflow re-fires or the message text changes. The alert row records the linked ticket, and when the alert clears (check recovers, agent comes back online, or an operator resolves it) the ticket is auto-resolved with an internal note. If the alert flaps and re-fires shortly after clearing, the same ticket is reopened rather than a duplicate raised; a ticket you close by hand stays closed, and a genuinely new incident gets a fresh ticket.

Workflow types with no alert behind them (eventlog, schedule, webhook_inbound) dedupe on (trigger_id, condition_data, day) instead, enforced by a unique partial index on tickets(org_id, source_type, source_id).

`send_notification`

Shows an interactive on-screen message card to the person at the affected agent, the same card the ad-hoc Send Message action raises from the Agents page. Delivered to the tray running in the user session, so it only reaches machines with a logged-in user and the tray installed (headless servers ignore it). Action params:

title / body — the card heading and message. Both support {agent_name}, {check_name}, {message}.
urgency — low / normal / high (controls the card accent).
timeout_seconds — seconds on screen before it auto-dismisses; 0 uses the tray default.
sticky — keep the card up until the user acts, ignoring the timeout.
actions — optional buttons, each { id, label, type }. Types: dismiss, acknowledge, open_url (needs url), snooze (optional snooze_minutes), and reboot_now. A reboot_now button runs the agent's normal reboot path when clicked, and is only offered to authors who hold the reboot permission.

The message is delivered per affected agent and recorded in that agent's message history.

Worked examples

"Memory full → run cleanup script"

Doug's original ask. New workflow:

Type: check_failure
Condition: check_type = memory, min_severity = warning, consecutive_failures = 2
Action: run_script, script_id = your memory cleanup script
Cooldown: 3600 (don't re-run within an hour, even if it stays red)
Enabled: on

The seed workflow ships with create_alert rather than run_script so it works out of the box. Open [Sample] Memory exhaustion, change the action type to run_script, pick your script, save, enable.

"Disk nearly full → run cleanup script"

Same shape, check_type = disk. The seed [Sample] Disk space low is the starting point.

"Agent offline > 10 minutes → create urgent ticket"

Type: agent_status
Condition: status = offline, after_minutes = 10
Action: create_ticket, title_template = Agent {agent_name} offline > 10 minutes, priority = urgent
Cooldown: 0 (state change won't repeat)

The ticket is keyed to the availability alert, so one offline agent gets one ticket for the whole outage, and the ticket auto-resolves when the agent comes back online.

"Critical Windows event → email and ticket"

Type: eventlog
Condition: log_name = System, level = Critical
Actions: create_ticket (priority high) AND fire_webhook to your incident channel

Multiple actions on one workflow execute in order. If one fails the rest still run.

"Daily disk usage report"

Type: schedule
Condition: cron = 0 7 * * * (7am every day)
Action: run_script, script_id = your reporting script
Scope: pinned to your reporting-server agent so it doesn't fan out to every endpoint

"Reboot needed → prompt the user at 4pm"

Type: schedule
Condition: cron = 0 16 * * * (4pm every day)
Scope: a smart asset group of machines flagged needs-reboot, so the prompt only reaches those endpoints
Action: send_notification, title = Reboot required, body = Your PC needs a reboot to finish updates., urgency = high, sticky = on, with two buttons: Reboot now (reboot_now) and Snooze (snooze, snooze_minutes = 60)

The card sits on screen until the user clicks. Reboot now runs the same reboot the bulk action uses; Snooze re-shows it an hour later. No new scheduling code, just a schedule trigger plus the group.

Sample workflows shipped with new tenants

New OpsMerge tenants get three starter examples in the editor, all disabled:

Sample	Trigger	Action
`[Sample] Disk space low`	`check_failure` / `check_type=disk` / first failure	`create_alert` (warning)
`[Sample] Memory exhaustion`	`check_failure` / `check_type=memory` / 2 consecutive	`create_alert` (warning)
`[Sample] CPU sustained high`	`check_failure` / `check_type=cpu` / 3 consecutive	`create_alert` (warning)

Each is a starting point. Common edits:

Swap create_alert for run_script and pick a remediation script you've written.
Lift the cooldown if you want noisier reporting; lower it if a slow-running cleanup script needs more time between fires.
Pin the scope (agent / site / client) if it should only apply to one client.

Workflow scope

Each workflow can target:

Global (no scope) — every agent in the tenant
A specific client — only fires for agents under that client
A specific site — only fires for agents at that site
A specific agent — only fires for that one box

Useful for client-specific behaviour without duplicating workflows.

Cooldown and de-duplication

Cooldown — minimum gap between fires of one workflow. A 5-minute default protects against runaway loops; bump to 3600 for "once per hour" patterns.
Workflow history — every fire writes to trigger_history with the condition data and per-action result. Read it via the workflow's history drawer.
Ticket dedupe — create_ticket keys the ticket to the alert incident behind the fire (check and agent-status workflows), so a failing check raises one ticket for the life of the alert, not one per fire, and the ticket auto-resolves when the alert clears. Non-alert workflow types (event log, schedule, inbound webhook) fall back to a stable source_id derived from (trigger_id, condition_data).

Versus the PSA workflows engine

There are two engines because they fire on different things:

	Monitoring workflows	PSA workflows
Path	`/monitoring/workflows`	`/psa/workflows/ticket-rules`
Fires on	RMM events (check fail, agent state, event log, schedule, inbound webhook)	PSA ticket events (created, updated, commented, SLA warn, scheduled stale, customer silence)
Condition shape	flat JSON per trigger type	tree of AND/OR groups, per-field operators
Actions	`run_script`, `create_alert`, `create_ticket`, `fire_webhook`	thirteen ticket-side actions plus `send_email` and `send_webhook`
Audit	`trigger_history` table	`ticket_rule_runs` table

A typical end-to-end flow uses both: monitoring workflow detects disk full → create_ticket → PSA workflow ticket.created fires with category = infrastructure → emails the on-call engineer.

Common patterns

"Run different script per check type"

One workflow per check type. Use the check_type condition to filter. Three workflows = three scripts.

"Don't run a script if a recent run already ran"

Set the cooldown high enough. Or chain: workflow A creates a ticket; workflow B ticket.created with category = X runs the script. PSA rules engine has richer guards (cooldown + max-fires) than the monitoring one.

"Test a workflow without waiting for a real failure"

The workflow list page has a "Test fire" button on each row. It synthesises a fake condition_data and runs the action chain so you can see whether your script execution / webhook delivery / ticket creation actually works.

Common issues

"Workflow fires but script doesn't run." Check trigger_history. If the action result shows NATS not available, the agent was offline at the time of fire — the run_script action is fire-and-forget without persistence. Re-fire by re-tripping the condition, or pre-empt via agent_status workflow.

"Webhook POSTs but receiving system 404s." SSRF protection blocks private addresses. The webhook must be publicly resolvable.

"create_ticket created the ticket but it's not in my expected queue." No assigned_team_id in the action means the ticket lands in the default queue. Add it to the action params.

"Workflow shows enabled=true but never fires." Three causes in order of likelihood:

Cooldown is still active from a recent fire.
The condition filter doesn't match — check_type = "disk" exactly, not "diskspace" or "disk_space". Canonical types: cpu, memory, disk, service, ping, http, script, eventlog.
The workflow is scoped to a specific agent/site/client and the failing agent doesn't match.

Migration from `/triggers`

Old route paths still resolve via redirect. Specifically:

/triggers → /monitoring/workflows
/triggers/new → /monitoring/workflows/new
/triggers/:id → /monitoring/workflows/:id

Existing workflows you authored before the reorg keep working — the underlying triggers table is unchanged. Only the UI route and nav label moved.

Monitoring & alerts — what gets checked and how alerts flow
Scripts — the script library that run_script actions invoke
PSA Workflows — the ticket-side engine that pairs with this one

Monitoring workflows ​

Anatomy of a workflow ​

Triggers ​

check_failure ​

agent_status ​

eventlog ​

schedule ​

webhook_inbound ​

Actions ​

run_script ​

create_alert ​

fire_webhook ​

create_ticket ​

send_notification ​

Worked examples ​

"Memory full → run cleanup script" ​

"Disk nearly full → run cleanup script" ​

"Agent offline > 10 minutes → create urgent ticket" ​

"Critical Windows event → email and ticket" ​

"Daily disk usage report" ​

"Reboot needed → prompt the user at 4pm" ​

Sample workflows shipped with new tenants ​

Workflow scope ​

Cooldown and de-duplication ​

Versus the PSA workflows engine ​

Common patterns ​

"Run different script per check type" ​

"Don't run a script if a recent run already ran" ​

"Test a workflow without waiting for a real failure" ​

Common issues ​

Migration from /triggers ​

Next ​