Status & incidents

Where to check

Live service status: status.opsmerge.cloud.

The status page is hosted independently of OpsMerge itself — if OpsMerge is down, the status page is still up. It shows:

Current status of each component: app, API, agent ingest, email gateway, integrations.
Ongoing incidents with rolling updates.
Recently-resolved incidents with full post-mortems.
Scheduled maintenance windows, usually announced at least 72 hours ahead.

Subscribe to updates via email, RSS, or Slack at the top of the status page.

What counts as an incident

We declare an incident when:

Any user-facing feature is broken for a meaningful slice of users (not just one tenant's misconfiguration).
Agent ingest is degraded — agents can't register or reconnect.
A scheduled job is failing — recurring invoices, M365 sync, retention pruning, etc.
Data integrity is at risk — extremely rare, and we treat as P1.

What's not an incident:

Slow API responses within our published SLA.
A specific client's email gateway misroute (that's a config issue with their domain or DMARC, not a platform incident).
A specific agent failing to install on one endpoint.

If you see something that looks like an incident but the status page is green, tell us. The status page is human-updated and we sometimes lag the start of an issue.

Maintenance windows

Planned maintenance happens:

Twice a month for routine database/infrastructure updates.
Quarterly for larger upgrades (e.g. Postgres major version, Kubernetes node refresh).
Ad-hoc for security patches (sometimes urgent, sometimes scheduled).

Routine maintenance is announced at least 72 hours ahead via the status page email subscribers. Ad-hoc security patches may be announced shorter-notice but we'll always explain why.

Most maintenance is zero-downtime — you might see a brief blip on agent reconnect. If a window is expected to cause downtime, the announcement says so explicitly.

What to do during an incident

Check the status page first. If it's already declared, we know.
Don't re-deploy your agents en masse. If agent ingest is degraded, mass reinstalls just amplify the load and slow recovery.
Note timestamps of anything that fails for you — we may ask later to correlate.
If the status page shows green but you're sure something's broken, contact support — describe what you're seeing.

Don't paper-cut your own data

A common bad reaction to an ongoing incident is to "fix it locally" by manually editing data, force-running scripts, or repeatedly re-issuing API calls. Sit on your hands — once the incident is resolved, OpsMerge usually self-heals (queued operations run, retries succeed). Pre-emptive local fixes often create more cleanup work than they save.

Post-mortems

For any incident lasting more than 30 minutes or with material data impact, we publish a post-mortem within 5 working days. The post-mortem includes:

A clear timeline of what happened.
Root cause (technical, with names of components — not vague "an issue with our infrastructure").
What we did to resolve.
What we're changing to prevent recurrence.

Post-mortems live on the status page under the resolved incident. They're public — anyone can read them, including your clients.

We do blameless post-mortems internally and externally. We don't write "engineer X failed to..." — we write "the deploy process didn't catch X" and fix the process. This is the standard SRE practice and it's how we keep improving.

Service level objectives

We target:

99.9% availability of the app and API (counted by 5-minute intervals across each calendar month).
99.9% agent ingest availability.
P50 < 200ms / P99 < 1s API response times for read endpoints, with caveats for slow analytical queries (which we surface explicitly).

We're transparent when we miss. If we miss SLO in a given month, the next month's status page summary calls it out. We don't currently offer service credits — the relationship with our customers is more "honest about misses, fix the cause" than "commercial penalty".

Incident communications

For active incidents:

Status page is updated every 15 minutes or on material change, whichever is sooner.
Email subscribers get an alert when an incident is declared and another when it's resolved.
Founder Discord gets ad-hoc updates from the team — usually more detail than the status page, because we're talking to people who can handle technical context.

For very serious incidents (multi-hour outage, data risk), we email all customers directly, not just status-page subscribers.

Reporting an outage to us

If the status page is green but you believe there's an outage, contact support with:

What's broken — the specific feature/page/integration.
Where you're connecting from — country, ISP if you know it.
When it started — even an approximate time helps us narrow correlation.
Any error messages — even partial.

We'd rather have a false positive than a missed real outage. Tell us if you're suspicious.

Status & incidents ​

Where to check ​

What counts as an incident ​

Maintenance windows ​

What to do during an incident ​

Post-mortems ​

Service level objectives ​

Incident communications ​

Reporting an outage to us ​

Next ​