IT & Systems

Healthcare & Allied Health

Updated March 2026

Server Monitoring — Healthcare & Allied Health Edition

A procedure for continuously monitoring the health, performance, and availability of servers and taking appropriate action when alerts are triggered.

Purpose

To proactively detect and address server issues before they impact business operations, maintaining optimal uptime, performance, and reliability of server infrastructure.

Scope

Covers all physical and virtual servers, including application servers, database servers, file servers, and web servers, whether hosted on-premises or in cloud environments.

Prerequisites

Server monitoring tools installed and configured for all managed servers
Defined alert thresholds and escalation paths
Access to server management interfaces and dashboards

Compliance Note

Includes safeguards for Australian Privacy Principles (APPs), Medicare compliance, and health record management under the My Health Records Act. All patient data handling follows AHPRA guidelines.

Step-by-Step Procedure

Review Monitoring Dashboard

Check the server monitoring dashboard at the start of each shift to assess the overall health and status of all monitored servers.

1.1Review the monitoring dashboard for any active alerts or warnings
1.2Check CPU, memory, disk, and network utilisation across all servers
1.3Note any servers showing degraded performance or elevated resource usage

Systems Operations Analyst

10 minutes

Server Monitoring Platform, Dashboard

Investigate Alerts

When an alert is triggered, investigate the cause by examining server logs, resource metrics, and recent changes.

2.1Identify the alert type and affected server
2.2Check server event logs and application logs for errors
2.3Review resource utilisation trends to identify the triggering condition

Systems Operations Analyst

15 minutes

Server Monitoring Platform, Log Analysis Tools

Take Remedial Action

Implement the appropriate response to resolve the alert, which may range from clearing a log file to restarting a service or escalating to engineering.

3.1Apply the standard response procedure for the alert type
3.2For common issues, execute the documented remediation steps
3.3For complex or unknown issues, escalate to the infrastructure engineering team

Systems Operations Analyst

15 minutes to 2 hours

Server Management Tools, Runbook

Tips

Follow the runbook procedures before attempting ad hoc fixes

Verify Resolution

After taking action, verify that the alert has cleared and the server has returned to normal operating parameters.

4.1Confirm the alert has cleared on the monitoring dashboard
4.2Verify server metrics have returned to normal thresholds
4.3Check that business applications hosted on the server are functioning correctly

Systems Operations Analyst

10 minutes

Server Monitoring Platform

Generate Performance Reports

Produce regular server performance and availability reports for management review and capacity planning.

5.1Generate weekly and monthly server performance reports
5.2Highlight any servers that are approaching capacity thresholds
5.3Provide recommendations for capacity expansion or optimisation

Systems Operations Analyst

1 hour weekly

Reporting Tools, Server Monitoring Platform

Update Monitoring Configuration

Periodically review and update monitoring thresholds, alert rules, and the list of monitored servers to maintain relevance.

6.1Review alert thresholds for appropriateness based on recent trends
6.2Add new servers to the monitoring platform when deployed
6.3Remove decommissioned servers from the monitoring configuration

Infrastructure Engineer

30 minutes monthly

Server Monitoring Platform

Quality Checkpoints

All production servers are included in the monitoring platform

Alert thresholds are reviewed and validated at least quarterly

All alerts are investigated and documented within the defined response time

Common Mistakes to Avoid

Ignoring warning alerts that have not yet reached critical levels, missing early intervention opportunities

Not updating monitoring when new servers are deployed, leaving them unmonitored

Creating too many alert notifications, leading to alert fatigue and missed critical alerts

Not following documented runbook procedures, leading to inconsistent responses

Expected Outcomes

Server Uptime

Percentage of time servers are available and operational, measuring infrastructure reliability.

Alert Response Time

Average time from alert trigger to investigation start, measuring monitoring responsiveness.

Frequently Asked Questions

How are monitoring thresholds set?

Thresholds are set based on vendor recommendations, historical performance data, and business requirements. They should be tuned to avoid both false positives and missed genuine issues.

What metrics are typically monitored on servers?

Key metrics include CPU utilisation, memory usage, disk space and input/output, network traffic, service status, event log errors, and application-specific performance counters.

What is the difference between a warning and a critical alert?

A warning alert indicates a metric is approaching a problematic threshold and requires attention. A critical alert indicates a threshold has been exceeded or a service has failed and requires immediate action.

Want this customised for YOUR business?

We'll tailor every step to your exact operations, tools, and team structure.

View Our Services