Appearance
Monitoring workflows
Monitoring → Workflows is where you wire RMM-side automation: when a check fails, an agent drops off, an event log entry matches a pattern, a cron schedule fires, or an external system POSTs to your tenant — do something. Same shape as PSA Workflows, different event source.
This is the engine that used to be called "Triggers" before the May 2026 IA reorg. Old /triggers URLs redirect into here.
Anatomy of a workflow
A workflow has three parts:
- Trigger type — which event source fires it. Five choices.
- Condition — a JSON shape that filters which events of that type actually fire the workflow.
- Actions — one or more things to do when the workflow fires.
Plus three guards:
- Scope — leave global, or pin to a specific agent / site / client.
- Cooldown — minimum gap between fires per workflow. Defaults to 5 minutes.
- Enabled — toggle without deleting.
Triggers
check_failure
Fires whenever a monitoring check reports fail, error, warn, or warning. The condition can filter by:
check_type—cpu,memory,disk,service,ping,http,script,eventlog, or*for allmin_severity—erroronly fires on hard fails;warningincludes bothconsecutive_failures— only fire after N back-to-back failures, so a single transient spike doesn't trigger a script run
agent_status
Fires when an agent transitions online → offline or vice versa. Condition supports:
status—offline,online, or bothafter_minutes— only fire if the agent has been in that state for at least N minutes (debounces flapping agents on a wobbly link)
eventlog
Fires when a Windows event log entry matches a pattern. Condition supports:
log_name— e.g.System,Application,Securitysource— the event source (e.g.disk,Service Control Manager)event_id— exact ID or list of IDslevel—Critical,Error,Warning,Information
schedule
Cron-style. Fires on a recurring schedule. Condition supports a standard cron expression. Use this for periodic housekeeping that isn't reactive to monitored events.
webhook_inbound
Exposes a per-tenant inbound URL. Any external system that POSTs to it (with the secret in the path) fires the workflow. Useful for stitching third-party alerting into OpsMerge.
Actions
Four kinds today:
run_script
Sends a script to the affected agent via NATS request/reply. The script runs as SYSTEM (Windows) or root (Linux/macOS). Action params:
script_id— pick from your script libraryargs— optional, comma-separatedtimeout_seconds— script execution cap
create_alert
Inserts an alert row in the alerts table — same surface as failing-check alerts. Action params:
severity—info/warning/errormessage— text, with simple template variables like{agent_name}and{check_name}
Useful when you want the workflow to surface in Monitoring → Alerts for triage, without coupling to script execution.
fire_webhook
POSTs a JSON payload to a URL you supply. Action params:
url— must be public (SSRF protection blocks RFC 1918 / loopback / link-local)payload_template— optional JSON body; falls back to a default{trigger_id, agent_id, condition_data}shape
create_ticket
Creates a PSA ticket from the workflow context. Action params:
title_template— supports{agent_name},{check_name},{message}priority—low/medium/high/urgentassigned_team_id— optional team to land oncategory_id— optional ticket category
Two workflows firing for the same condition data write the same ticket once — the dedupe key is (trigger_id, condition_data), enforced by a unique partial index on tickets(org_id, source_type, source_id).
Worked examples
"Memory full → run cleanup script"
Doug's original ask. New workflow:
- Type:
check_failure - Condition:
check_type = memory,min_severity = warning,consecutive_failures = 2 - Action:
run_script,script_id= your memory cleanup script - Cooldown: 3600 (don't re-run within an hour, even if it stays red)
- Enabled: on
The seed workflow ships with create_alert rather than run_script so it works out of the box. Open [Sample] Memory exhaustion, change the action type to run_script, pick your script, save, enable.
"Disk nearly full → run cleanup script"
Same shape, check_type = disk. The seed [Sample] Disk space low is the starting point.
"Agent offline > 10 minutes → create urgent ticket"
- Type:
agent_status - Condition:
status = offline,after_minutes = 10 - Action:
create_ticket,title_template = Agent {agent_name} offline > 10 minutes,priority = urgent - Cooldown: 0 (state change won't repeat)
The dedupe key prevents one offline agent from creating two tickets across retries.
"Critical Windows event → email and ticket"
- Type:
eventlog - Condition:
log_name = System,level = Critical - Actions:
create_ticket(priority high) ANDfire_webhookto your incident channel
Multiple actions on one workflow execute in order. If one fails the rest still run.
"Daily disk usage report"
- Type:
schedule - Condition:
cron = 0 7 * * *(7am every day) - Action:
run_script,script_id= your reporting script - Scope: pinned to your reporting-server agent so it doesn't fan out to every endpoint
Sample workflows shipped with new tenants
New OpsMerge tenants get three starter examples in the editor, all disabled:
| Sample | Trigger | Action |
|---|---|---|
[Sample] Disk space low | check_failure / check_type=disk / first failure | create_alert (warning) |
[Sample] Memory exhaustion | check_failure / check_type=memory / 2 consecutive | create_alert (warning) |
[Sample] CPU sustained high | check_failure / check_type=cpu / 3 consecutive | create_alert (warning) |
Each is a starting point. Common edits:
- Swap
create_alertforrun_scriptand pick a remediation script you've written. - Lift the cooldown if you want noisier reporting; lower it if a slow-running cleanup script needs more time between fires.
- Pin the scope (agent / site / client) if it should only apply to one client.
Workflow scope
Each workflow can target:
- Global (no scope) — every agent in the tenant
- A specific client — only fires for agents under that client
- A specific site — only fires for agents at that site
- A specific agent — only fires for that one box
Useful for client-specific behaviour without duplicating workflows.
Cooldown and de-duplication
- Cooldown — minimum gap between fires of one workflow. A 5-minute default protects against runaway loops; bump to 3600 for "once per hour" patterns.
- Workflow history — every fire writes to
trigger_historywith the condition data and per-action result. Read it via the workflow's history drawer. - Ticket dedupe —
create_ticketderives a stable source_id from(trigger_id, condition_data)so the same monitoring failure doesn't write the same ticket twice across retries.
Versus the PSA workflows engine
There are two engines because they fire on different things:
| Monitoring workflows | PSA workflows | |
|---|---|---|
| Path | /monitoring/workflows | /psa/workflows/ticket-rules |
| Fires on | RMM events (check fail, agent state, event log, schedule, inbound webhook) | PSA ticket events (created, updated, commented, SLA warn, scheduled stale, customer silence) |
| Condition shape | flat JSON per trigger type | tree of AND/OR groups, per-field operators |
| Actions | run_script, create_alert, create_ticket, fire_webhook | thirteen ticket-side actions plus send_email and send_webhook |
| Audit | trigger_history table | ticket_rule_runs table |
A typical end-to-end flow uses both: monitoring workflow detects disk full → create_ticket → PSA workflow ticket.created fires with category = infrastructure → emails the on-call engineer.
Common patterns
"Run different script per check type"
One workflow per check type. Use the check_type condition to filter. Three workflows = three scripts.
"Don't run a script if a recent run already ran"
Set the cooldown high enough. Or chain: workflow A creates a ticket; workflow B ticket.created with category = X runs the script. PSA rules engine has richer guards (cooldown + max-fires) than the monitoring one.
"Test a workflow without waiting for a real failure"
The workflow list page has a "Test fire" button on each row. It synthesises a fake condition_data and runs the action chain so you can see whether your script execution / webhook delivery / ticket creation actually works.
Common issues
"Workflow fires but script doesn't run." Check trigger_history. If the action result shows NATS not available, the agent was offline at the time of fire — the run_script action is fire-and-forget without persistence. Re-fire by re-tripping the condition, or pre-empt via agent_status workflow.
"Webhook POSTs but receiving system 404s." SSRF protection blocks private addresses. The webhook must be publicly resolvable.
"create_ticket created the ticket but it's not in my expected queue." No assigned_team_id in the action means the ticket lands in the default queue. Add it to the action params.
"Workflow shows enabled=true but never fires." Three causes in order of likelihood:
- Cooldown is still active from a recent fire.
- The condition filter doesn't match —
check_type = "disk"exactly, not"diskspace"or"disk_space". Canonical types:cpu,memory,disk,service,ping,http,script,eventlog. - The workflow is scoped to a specific agent/site/client and the failing agent doesn't match.
Migration from /triggers
Old route paths still resolve via redirect. Specifically:
/triggers→/monitoring/workflows/triggers/new→/monitoring/workflows/new/triggers/:id→/monitoring/workflows/:id
Existing workflows you authored before the reorg keep working — the underlying triggers table is unchanged. Only the UI route and nav label moved.
Next
- Monitoring & alerts — what gets checked and how alerts flow
- Scripts — the script library that
run_scriptactions invoke - PSA Workflows — the ticket-side engine that pairs with this one