Monitoring & alerts

OpsMerge watches every monitored endpoint continuously and turns failing checks into alerts, which can become tickets. This article covers what's checked, how to add custom checks, and how alerts flow into your PSA.

Out-of-the-box checks

Every agent runs a baseline set of checks without configuration:

Check	What it does	Default threshold
CPU load	1/5/15-minute averages	Alert > 90% sustained for 5 min
Memory	Available memory	Alert < 10% available
Disk space	Per-volume free space	Alert < 10% free, OR < 5GB if volume > 100GB
Disk I/O	Read/write latency on each disk	Alert > 100ms average for 5 min
Network	Per-interface up/down state	Alert on transition
Service status (Windows)	Core OS + agent services	Alert if `Stopped` when set to `Auto`
systemd unit status (Linux)	Agent + standard system units	Alert if `failed` or `inactive` when expected `active`
Heartbeat	Agent connectivity to OpsMerge	Alert if no heartbeat for > 5 min
Updates available (Windows)	Pending Windows Updates	Information-only (no alert), unless > 30 days outstanding
Updates available (Linux)	`apt`/`dnf` available updates	Information-only
Pending reboot	Reboot required after update	Information-only

These defaults are tuned for typical office endpoints. For servers and high-criticality machines, you'll want tighter thresholds — see Policies for how to override.

Adding custom checks

Settings → RMM → Checks → + New check.

Three flavours of custom check:

1. Threshold check on a metric the agent already collects

E.g. "alert if a specific Windows service is stopped" or "alert if CPU temperature > 85°C".

Pick the metric, set the comparison, set the threshold, set the duration. Done.

2. Script-based check

The agent runs a script on a schedule, and the script's exit code determines pass/fail:

Exit 0 = check passed.
Exit non-zero = check failed; the script's stdout is the alert message.

Example: a PowerShell script that checks whether a specific log file has new error entries since the last run.

powershell

$ErrorCount = (Get-Content C:\App\error.log | Measure-Object -Line).Lines
if ($ErrorCount -gt 100) {
    Write-Output "Error log has $ErrorCount entries — investigate."
    exit 1
}
exit 0

3. External check (no agent involvement)

OpsMerge itself reaches out and tests something — useful for domain monitoring, but also for "is this server's external HTTPS endpoint responding?". See Domain monitoring for the externalcheck infrastructure.

Alert rules

A check is a thing that can pass or fail. An alert rule decides what to do when a check fails.

Settings → RMM → Alert rules → + New rule.

You set:

Which check(s) trigger the rule.
Scope: all endpoints, specific client, specific tag, etc.
Severity: Info / Warning / Critical.
Action: notify, create a ticket, run a script, all of the above.

Actions:

Action	Behaviour
Notify	A team member's email/dashboard, no ticket created. Useful for "I want to know but it's not urgent".
Create ticket	An OpsMerge ticket is created, one per alert incident. Priority and assignment per rule.
Run script	Run a script against the affected endpoint (with safety guards).
Acknowledge automatically	Auto-clear when the check goes green again (default behaviour anyway).

Multiple actions can fire from one rule. The most common pattern is "Notify + Create ticket".

Alert lifecycle

Check fails ──> Alert raised ──> (optional) Ticket created
                                       │
                                       │  while alert is active
                                       │
Check passes again ──> Alert cleared ──┴──> Ticket linked to alert auto-updates

An alert is the live state. While the check is failing, the alert is "active".
A ticket is the PSA-side record. Tickets created from alerts are linked back to the alert, and the alert stays linked to one ticket for the life of the incident, no matter how often the workflow re-fires.
When the check goes green, the alert auto-clears and the linked ticket is auto-resolved with an internal note. A brief flap that re-fires the alert reopens the same ticket; a ticket you closed by hand stays closed and a genuinely new incident raises a fresh one.

If a check fails, passes, fails again, the original alert is reused — you don't get a swarm of duplicate alerts for a flapping condition. The alert just goes active/cleared/active.

Mute, snooze, suppress

Action	Effect	When to use
Mute check on endpoint	Specific check stops firing for this endpoint.	The endpoint is intentionally in a weird state and shouldn't alert.
Snooze alert	Active alert is hidden for X hours/days.	You're aware, you'll deal with it, stop bothering you.
Suppress during window	Alerts during specific times (e.g. maintenance windows) are auto-acknowledged.	Scheduled work where alerts would be noise.

Mutes and suppressions are visible in the Suppressed alerts view so they don't get silently forgotten.

VIP escalation

If a check fails on an endpoint owned by a VIP contact (see Clients & contacts), the alert's severity is automatically bumped one level — Info becomes Warning, Warning becomes Critical. This is on by default and configurable per tenant.

Common patterns

"I want to be paged for prod server outages but not for office laptop flakiness"

Tag your prod server endpoints with criticality:prod.
Create an alert rule: scope = criticality:prod, severity = Critical, action = Notify on-call rotation.
Existing alerts on non-prod endpoints continue to behave normally.

"Disk space alerts are noisy for our terminal-server clients (always full, always being managed)"

Identify the affected endpoints.
Either mute the disk-space check on those endpoints, or change the policy threshold for that client to a more permissive value.

"I need to monitor a metric the agent doesn't currently collect"

Two options:

Write a script-based check (the agent runs it, the check uses the script's exit code).
Write a tray-app metric publisher (advanced — talk to support for guidance).

Common issues

Alert keeps firing for something I "fixed". The underlying check is still failing. Alerts auto-clear when the check goes green; if it's still active, the condition is still present. Look at the check's recent values on the agent's page.

A check is firing on the wrong client's endpoints. Almost always a scoping mistake on the alert rule. Open the rule, check its scope (client/tag filter) — it's likely too broad.

I muted a check but it's still alerting. Mute applies to that specific check + endpoint combination. If you muted "Disk space" on endpoint X, but a separate rule also alerts on "Disk space" with a custom check, that won't be muted. Mute scope is per-check-and-endpoint.

Alert templates — the multi-channel notification config (email + Slack + Teams + Discord + webhook + SMS) and the agent/site/client cascade that decides which one fires
Workflows — automate the response side: when a check fails, run a script, create a ticket, or fire a webhook
Scripts & automation — the script library workflows invoke
Tickets — what happens to alerts that become tickets

Monitoring & alerts ​

Out-of-the-box checks ​

Adding custom checks ​

1. Threshold check on a metric the agent already collects ​

2. Script-based check ​

3. External check (no agent involvement) ​

Alert rules ​

Alert lifecycle ​

Mute, snooze, suppress ​

VIP escalation ​

Common patterns ​

"I want to be paged for prod server outages but not for office laptop flakiness" ​

"Disk space alerts are noisy for our terminal-server clients (always full, always being managed)" ​

"I need to monitor a metric the agent doesn't currently collect" ​

Common issues ​

Next ​