Appearance
Monitoring & alerts
OpsMerge watches every monitored endpoint continuously and turns failing checks into alerts, which can become tickets. This article covers what's checked, how to add custom checks, and how alerts flow into your PSA.
Out-of-the-box checks
Every agent runs a baseline set of checks without configuration:
| Check | What it does | Default threshold |
|---|---|---|
| CPU load | 1/5/15-minute averages | Alert > 90% sustained for 5 min |
| Memory | Available memory | Alert < 10% available |
| Disk space | Per-volume free space | Alert < 10% free, OR < 5GB if volume > 100GB |
| Disk I/O | Read/write latency on each disk | Alert > 100ms average for 5 min |
| Network | Per-interface up/down state | Alert on transition |
| Service status (Windows) | Core OS + agent services | Alert if Stopped when set to Auto |
| systemd unit status (Linux) | Agent + standard system units | Alert if failed or inactive when expected active |
| Heartbeat | Agent connectivity to OpsMerge | Alert if no heartbeat for > 5 min |
| Updates available (Windows) | Pending Windows Updates | Information-only (no alert), unless > 30 days outstanding |
| Updates available (Linux) | apt/dnf available updates | Information-only |
| Pending reboot | Reboot required after update | Information-only |
These defaults are tuned for typical office endpoints. For servers and high-criticality machines, you'll want tighter thresholds — see Policies for how to override.
Adding custom checks
Settings → RMM → Checks → + New check.
Three flavours of custom check:
1. Threshold check on a metric the agent already collects
E.g. "alert if a specific Windows service is stopped" or "alert if CPU temperature > 85°C".
Pick the metric, set the comparison, set the threshold, set the duration. Done.
2. Script-based check
The agent runs a script on a schedule, and the script's exit code determines pass/fail:
- Exit 0 = check passed.
- Exit non-zero = check failed; the script's stdout is the alert message.
Example: a PowerShell script that checks whether a specific log file has new error entries since the last run.
powershell
$ErrorCount = (Get-Content C:\App\error.log | Measure-Object -Line).Lines
if ($ErrorCount -gt 100) {
Write-Output "Error log has $ErrorCount entries — investigate."
exit 1
}
exit 03. External check (no agent involvement)
OpsMerge itself reaches out and tests something — useful for domain monitoring, but also for "is this server's external HTTPS endpoint responding?". See Domain monitoring for the externalcheck infrastructure.
Alert rules
A check is a thing that can pass or fail. An alert rule decides what to do when a check fails.
Settings → RMM → Alert rules → + New rule.
You set:
- Which check(s) trigger the rule.
- Scope: all endpoints, specific client, specific tag, etc.
- Severity: Info / Warning / Critical.
- Action: notify, create a ticket, run a script, all of the above.
Actions:
| Action | Behaviour |
|---|---|
| Notify | A team member's email/dashboard, no ticket created. Useful for "I want to know but it's not urgent". |
| Create ticket | An OpsMerge ticket is created. Priority and assignment per rule. |
| Run script | Run a script against the affected endpoint (with safety guards). |
| Acknowledge automatically | Auto-clear when the check goes green again (default behaviour anyway). |
Multiple actions can fire from one rule. The most common pattern is "Notify + Create ticket".
Alert lifecycle
Check fails ──> Alert raised ──> (optional) Ticket created
│
│ while alert is active
│
Check passes again ──> Alert cleared ──┴──> Ticket linked to alert auto-updates- An alert is the live state. While the check is failing, the alert is "active".
- A ticket is the PSA-side record. Tickets created from alerts are linked back to the alert.
- When the check goes green, the alert auto-clears. The linked ticket may auto-close (configurable) or just get an update comment ("alert resolved").
If a check fails, passes, fails again, the original alert is reused — you don't get a swarm of duplicate alerts for a flapping condition. The alert just goes active/cleared/active.
Mute, snooze, suppress
| Action | Effect | When to use |
|---|---|---|
| Mute check on endpoint | Specific check stops firing for this endpoint. | The endpoint is intentionally in a weird state and shouldn't alert. |
| Snooze alert | Active alert is hidden for X hours/days. | You're aware, you'll deal with it, stop bothering you. |
| Suppress during window | Alerts during specific times (e.g. maintenance windows) are auto-acknowledged. | Scheduled work where alerts would be noise. |
Mutes and suppressions are visible in the Suppressed alerts view so they don't get silently forgotten.
VIP escalation
If a check fails on an endpoint owned by a VIP contact (see Clients & contacts), the alert's severity is automatically bumped one level — Info becomes Warning, Warning becomes Critical. This is on by default and configurable per tenant.
Common patterns
"I want to be paged for prod server outages but not for office laptop flakiness"
- Tag your prod server endpoints with
criticality:prod. - Create an alert rule: scope =
criticality:prod, severity = Critical, action = Notify on-call rotation. - Existing alerts on non-prod endpoints continue to behave normally.
"Disk space alerts are noisy for our terminal-server clients (always full, always being managed)"
- Identify the affected endpoints.
- Either mute the disk-space check on those endpoints, or change the policy threshold for that client to a more permissive value.
"I need to monitor a metric the agent doesn't currently collect"
Two options:
- Write a script-based check (the agent runs it, the check uses the script's exit code).
- Write a tray-app metric publisher (advanced — talk to support for guidance).
Common issues
Alert keeps firing for something I "fixed". The underlying check is still failing. Alerts auto-clear when the check goes green; if it's still active, the condition is still present. Look at the check's recent values on the agent's page.
A check is firing on the wrong client's endpoints. Almost always a scoping mistake on the alert rule. Open the rule, check its scope (client/tag filter) — it's likely too broad.
I muted a check but it's still alerting. Mute applies to that specific check + endpoint combination. If you muted "Disk space" on endpoint X, but a separate rule also alerts on "Disk space" with a custom check, that won't be muted. Mute scope is per-check-and-endpoint.
Next
- Alert templates — the multi-channel notification config (email + Slack + Teams + Discord + webhook + SMS) and the agent/site/client cascade that decides which one fires
- Workflows — automate the response side: when a check fails, run a script, create a ticket, or fire a webhook
- Scripts & automation — the script library workflows invoke
- Tickets — what happens to alerts that become tickets