Skip to content

Monitoring & alerts

OpsMerge watches every monitored endpoint continuously and turns failing checks into alerts, which can become tickets. This article covers what's checked, how to add custom checks, and how alerts flow into your PSA.

Out-of-the-box checks

Every agent runs a baseline set of checks without configuration:

CheckWhat it doesDefault threshold
CPU load1/5/15-minute averagesAlert > 90% sustained for 5 min
MemoryAvailable memoryAlert < 10% available
Disk spacePer-volume free spaceAlert < 10% free, OR < 5GB if volume > 100GB
Disk I/ORead/write latency on each diskAlert > 100ms average for 5 min
NetworkPer-interface up/down stateAlert on transition
Service status (Windows)Core OS + agent servicesAlert if Stopped when set to Auto
systemd unit status (Linux)Agent + standard system unitsAlert if failed or inactive when expected active
HeartbeatAgent connectivity to OpsMergeAlert if no heartbeat for > 5 min
Updates available (Windows)Pending Windows UpdatesInformation-only (no alert), unless > 30 days outstanding
Updates available (Linux)apt/dnf available updatesInformation-only
Pending rebootReboot required after updateInformation-only

These defaults are tuned for typical office endpoints. For servers and high-criticality machines, you'll want tighter thresholds — see Policies for how to override.

Adding custom checks

Settings → RMM → Checks → + New check.

Three flavours of custom check:

1. Threshold check on a metric the agent already collects

E.g. "alert if a specific Windows service is stopped" or "alert if CPU temperature > 85°C".

Pick the metric, set the comparison, set the threshold, set the duration. Done.

2. Script-based check

The agent runs a script on a schedule, and the script's exit code determines pass/fail:

  • Exit 0 = check passed.
  • Exit non-zero = check failed; the script's stdout is the alert message.

Example: a PowerShell script that checks whether a specific log file has new error entries since the last run.

powershell
$ErrorCount = (Get-Content C:\App\error.log | Measure-Object -Line).Lines
if ($ErrorCount -gt 100) {
    Write-Output "Error log has $ErrorCount entries — investigate."
    exit 1
}
exit 0

3. External check (no agent involvement)

OpsMerge itself reaches out and tests something — useful for domain monitoring, but also for "is this server's external HTTPS endpoint responding?". See Domain monitoring for the externalcheck infrastructure.

Alert rules

A check is a thing that can pass or fail. An alert rule decides what to do when a check fails.

Settings → RMM → Alert rules → + New rule.

You set:

  • Which check(s) trigger the rule.
  • Scope: all endpoints, specific client, specific tag, etc.
  • Severity: Info / Warning / Critical.
  • Action: notify, create a ticket, run a script, all of the above.

Actions:

ActionBehaviour
NotifyA team member's email/dashboard, no ticket created. Useful for "I want to know but it's not urgent".
Create ticketAn OpsMerge ticket is created. Priority and assignment per rule.
Run scriptRun a script against the affected endpoint (with safety guards).
Acknowledge automaticallyAuto-clear when the check goes green again (default behaviour anyway).

Multiple actions can fire from one rule. The most common pattern is "Notify + Create ticket".

Alert lifecycle

Check fails ──> Alert raised ──> (optional) Ticket created

                                       │  while alert is active

Check passes again ──> Alert cleared ──┴──> Ticket linked to alert auto-updates
  • An alert is the live state. While the check is failing, the alert is "active".
  • A ticket is the PSA-side record. Tickets created from alerts are linked back to the alert.
  • When the check goes green, the alert auto-clears. The linked ticket may auto-close (configurable) or just get an update comment ("alert resolved").

If a check fails, passes, fails again, the original alert is reused — you don't get a swarm of duplicate alerts for a flapping condition. The alert just goes active/cleared/active.

Mute, snooze, suppress

ActionEffectWhen to use
Mute check on endpointSpecific check stops firing for this endpoint.The endpoint is intentionally in a weird state and shouldn't alert.
Snooze alertActive alert is hidden for X hours/days.You're aware, you'll deal with it, stop bothering you.
Suppress during windowAlerts during specific times (e.g. maintenance windows) are auto-acknowledged.Scheduled work where alerts would be noise.

Mutes and suppressions are visible in the Suppressed alerts view so they don't get silently forgotten.

VIP escalation

If a check fails on an endpoint owned by a VIP contact (see Clients & contacts), the alert's severity is automatically bumped one level — Info becomes Warning, Warning becomes Critical. This is on by default and configurable per tenant.

Common patterns

"I want to be paged for prod server outages but not for office laptop flakiness"

  1. Tag your prod server endpoints with criticality:prod.
  2. Create an alert rule: scope = criticality:prod, severity = Critical, action = Notify on-call rotation.
  3. Existing alerts on non-prod endpoints continue to behave normally.

"Disk space alerts are noisy for our terminal-server clients (always full, always being managed)"

  1. Identify the affected endpoints.
  2. Either mute the disk-space check on those endpoints, or change the policy threshold for that client to a more permissive value.

"I need to monitor a metric the agent doesn't currently collect"

Two options:

  • Write a script-based check (the agent runs it, the check uses the script's exit code).
  • Write a tray-app metric publisher (advanced — talk to support for guidance).

Common issues

Alert keeps firing for something I "fixed". The underlying check is still failing. Alerts auto-clear when the check goes green; if it's still active, the condition is still present. Look at the check's recent values on the agent's page.

A check is firing on the wrong client's endpoints. Almost always a scoping mistake on the alert rule. Open the rule, check its scope (client/tag filter) — it's likely too broad.

I muted a check but it's still alerting. Mute applies to that specific check + endpoint combination. If you muted "Disk space" on endpoint X, but a separate rule also alerts on "Disk space" with a custom check, that won't be muted. Mute scope is per-check-and-endpoint.

Next

  • Alert templates — the multi-channel notification config (email + Slack + Teams + Discord + webhook + SMS) and the agent/site/client cascade that decides which one fires
  • Workflows — automate the response side: when a check fails, run a script, create a ticket, or fire a webhook
  • Scripts & automation — the script library workflows invoke
  • Tickets — what happens to alerts that become tickets

OpsMerge is a product of Brindleford Technologies Ltd, company number 16871436, registered in England and Wales.