Skip to content

Monitoring workflows

Monitoring → Workflows is where you wire RMM-side automation: when a check fails, an agent drops off, an event log entry matches a pattern, a cron schedule fires, or an external system POSTs to your tenant — do something. Same shape as PSA Workflows, different event source.

This is the engine that used to be called "Triggers" before the May 2026 IA reorg. Old /triggers URLs redirect into here.

Anatomy of a workflow

A workflow has three parts:

  1. Trigger type — which event source fires it. Five choices.
  2. Condition — a JSON shape that filters which events of that type actually fire the workflow.
  3. Actions — one or more things to do when the workflow fires.

Plus three guards:

  • Scope — leave global, or pin to a specific agent / site / client.
  • Cooldown — minimum gap between fires per workflow. Defaults to 5 minutes.
  • Enabled — toggle without deleting.

Triggers

check_failure

Fires whenever a monitoring check reports fail, error, warn, or warning. The condition can filter by:

  • check_typecpu, memory, disk, service, ping, http, script, eventlog, or * for all
  • min_severityerror only fires on hard fails; warning includes both
  • consecutive_failures — only fire after N back-to-back failures, so a single transient spike doesn't trigger a script run

agent_status

Fires when an agent transitions online → offline or vice versa. Condition supports:

  • statusoffline, online, or both
  • after_minutes — only fire if the agent has been in that state for at least N minutes (debounces flapping agents on a wobbly link)

eventlog

Fires when a Windows event log entry matches a pattern. Condition supports:

  • log_name — e.g. System, Application, Security
  • source — the event source (e.g. disk, Service Control Manager)
  • event_id — exact ID or list of IDs
  • levelCritical, Error, Warning, Information

schedule

Cron-style. Fires on a recurring schedule. Condition supports a standard cron expression. Use this for periodic housekeeping that isn't reactive to monitored events.

webhook_inbound

Exposes a per-tenant inbound URL. Any external system that POSTs to it (with the secret in the path) fires the workflow. Useful for stitching third-party alerting into OpsMerge.

Actions

Four kinds today:

run_script

Sends a script to the affected agent via NATS request/reply. The script runs as SYSTEM (Windows) or root (Linux/macOS). Action params:

  • script_id — pick from your script library
  • args — optional, comma-separated
  • timeout_seconds — script execution cap

create_alert

Inserts an alert row in the alerts table — same surface as failing-check alerts. Action params:

  • severityinfo / warning / error
  • message — text, with simple template variables like {agent_name} and {check_name}

Useful when you want the workflow to surface in Monitoring → Alerts for triage, without coupling to script execution.

fire_webhook

POSTs a JSON payload to a URL you supply. Action params:

  • url — must be public (SSRF protection blocks RFC 1918 / loopback / link-local)
  • payload_template — optional JSON body; falls back to a default {trigger_id, agent_id, condition_data} shape

create_ticket

Creates a PSA ticket from the workflow context. Action params:

  • title_template — supports {agent_name}, {check_name}, {message}
  • prioritylow / medium / high / urgent
  • assigned_team_id — optional team to land on
  • category_id — optional ticket category

Two workflows firing for the same condition data write the same ticket once — the dedupe key is (trigger_id, condition_data), enforced by a unique partial index on tickets(org_id, source_type, source_id).

Worked examples

"Memory full → run cleanup script"

Doug's original ask. New workflow:

  • Type: check_failure
  • Condition: check_type = memory, min_severity = warning, consecutive_failures = 2
  • Action: run_script, script_id = your memory cleanup script
  • Cooldown: 3600 (don't re-run within an hour, even if it stays red)
  • Enabled: on

The seed workflow ships with create_alert rather than run_script so it works out of the box. Open [Sample] Memory exhaustion, change the action type to run_script, pick your script, save, enable.

"Disk nearly full → run cleanup script"

Same shape, check_type = disk. The seed [Sample] Disk space low is the starting point.

"Agent offline > 10 minutes → create urgent ticket"

  • Type: agent_status
  • Condition: status = offline, after_minutes = 10
  • Action: create_ticket, title_template = Agent {agent_name} offline > 10 minutes, priority = urgent
  • Cooldown: 0 (state change won't repeat)

The dedupe key prevents one offline agent from creating two tickets across retries.

"Critical Windows event → email and ticket"

  • Type: eventlog
  • Condition: log_name = System, level = Critical
  • Actions: create_ticket (priority high) AND fire_webhook to your incident channel

Multiple actions on one workflow execute in order. If one fails the rest still run.

"Daily disk usage report"

  • Type: schedule
  • Condition: cron = 0 7 * * * (7am every day)
  • Action: run_script, script_id = your reporting script
  • Scope: pinned to your reporting-server agent so it doesn't fan out to every endpoint

Sample workflows shipped with new tenants

New OpsMerge tenants get three starter examples in the editor, all disabled:

SampleTriggerAction
[Sample] Disk space lowcheck_failure / check_type=disk / first failurecreate_alert (warning)
[Sample] Memory exhaustioncheck_failure / check_type=memory / 2 consecutivecreate_alert (warning)
[Sample] CPU sustained highcheck_failure / check_type=cpu / 3 consecutivecreate_alert (warning)

Each is a starting point. Common edits:

  • Swap create_alert for run_script and pick a remediation script you've written.
  • Lift the cooldown if you want noisier reporting; lower it if a slow-running cleanup script needs more time between fires.
  • Pin the scope (agent / site / client) if it should only apply to one client.

Workflow scope

Each workflow can target:

  • Global (no scope) — every agent in the tenant
  • A specific client — only fires for agents under that client
  • A specific site — only fires for agents at that site
  • A specific agent — only fires for that one box

Useful for client-specific behaviour without duplicating workflows.

Cooldown and de-duplication

  • Cooldown — minimum gap between fires of one workflow. A 5-minute default protects against runaway loops; bump to 3600 for "once per hour" patterns.
  • Workflow history — every fire writes to trigger_history with the condition data and per-action result. Read it via the workflow's history drawer.
  • Ticket dedupecreate_ticket derives a stable source_id from (trigger_id, condition_data) so the same monitoring failure doesn't write the same ticket twice across retries.

Versus the PSA workflows engine

There are two engines because they fire on different things:

Monitoring workflowsPSA workflows
Path/monitoring/workflows/psa/workflows/ticket-rules
Fires onRMM events (check fail, agent state, event log, schedule, inbound webhook)PSA ticket events (created, updated, commented, SLA warn, scheduled stale, customer silence)
Condition shapeflat JSON per trigger typetree of AND/OR groups, per-field operators
Actionsrun_script, create_alert, create_ticket, fire_webhookthirteen ticket-side actions plus send_email and send_webhook
Audittrigger_history tableticket_rule_runs table

A typical end-to-end flow uses both: monitoring workflow detects disk full → create_ticket → PSA workflow ticket.created fires with category = infrastructure → emails the on-call engineer.

Common patterns

"Run different script per check type"

One workflow per check type. Use the check_type condition to filter. Three workflows = three scripts.

"Don't run a script if a recent run already ran"

Set the cooldown high enough. Or chain: workflow A creates a ticket; workflow B ticket.created with category = X runs the script. PSA rules engine has richer guards (cooldown + max-fires) than the monitoring one.

"Test a workflow without waiting for a real failure"

The workflow list page has a "Test fire" button on each row. It synthesises a fake condition_data and runs the action chain so you can see whether your script execution / webhook delivery / ticket creation actually works.

Common issues

"Workflow fires but script doesn't run." Check trigger_history. If the action result shows NATS not available, the agent was offline at the time of fire — the run_script action is fire-and-forget without persistence. Re-fire by re-tripping the condition, or pre-empt via agent_status workflow.

"Webhook POSTs but receiving system 404s." SSRF protection blocks private addresses. The webhook must be publicly resolvable.

"create_ticket created the ticket but it's not in my expected queue." No assigned_team_id in the action means the ticket lands in the default queue. Add it to the action params.

"Workflow shows enabled=true but never fires." Three causes in order of likelihood:

  1. Cooldown is still active from a recent fire.
  2. The condition filter doesn't match — check_type = "disk" exactly, not "diskspace" or "disk_space". Canonical types: cpu, memory, disk, service, ping, http, script, eventlog.
  3. The workflow is scoped to a specific agent/site/client and the failing agent doesn't match.

Migration from /triggers

Old route paths still resolve via redirect. Specifically:

  • /triggers/monitoring/workflows
  • /triggers/new/monitoring/workflows/new
  • /triggers/:id/monitoring/workflows/:id

Existing workflows you authored before the reorg keep working — the underlying triggers table is unchanged. Only the UI route and nav label moved.

Next

OpsMerge is a product of Brindleford Technologies Ltd, company number 16871436, registered in England and Wales.