Server Monitoring — Healthcare & Allied Health Edition
A procedure for continuously monitoring the health, performance, and availability of servers and taking appropriate action when alerts are triggered.
Purpose
To proactively detect and address server issues before they impact business operations, maintaining optimal uptime, performance, and reliability of server infrastructure.
Scope
Covers all physical and virtual servers, including application servers, database servers, file servers, and web servers, whether hosted on-premises or in cloud environments.
Prerequisites
- Server monitoring tools installed and configured for all managed servers
- Defined alert thresholds and escalation paths
- Access to server management interfaces and dashboards
Includes safeguards for Australian Privacy Principles (APPs), Medicare compliance, and health record management under the My Health Records Act. All patient data handling follows AHPRA guidelines.
Step-by-Step Procedure
Review Monitoring Dashboard
Check the server monitoring dashboard at the start of each shift to assess the overall health and status of all monitored servers.
- 1.1Review the monitoring dashboard for any active alerts or warnings
- 1.2Check CPU, memory, disk, and network utilisation across all servers
- 1.3Note any servers showing degraded performance or elevated resource usage
Investigate Alerts
When an alert is triggered, investigate the cause by examining server logs, resource metrics, and recent changes.
- 2.1Identify the alert type and affected server
- 2.2Check server event logs and application logs for errors
- 2.3Review resource utilisation trends to identify the triggering condition
Take Remedial Action
Implement the appropriate response to resolve the alert, which may range from clearing a log file to restarting a service or escalating to engineering.
- 3.1Apply the standard response procedure for the alert type
- 3.2For common issues, execute the documented remediation steps
- 3.3For complex or unknown issues, escalate to the infrastructure engineering team
- Follow the runbook procedures before attempting ad hoc fixes
Verify Resolution
After taking action, verify that the alert has cleared and the server has returned to normal operating parameters.
- 4.1Confirm the alert has cleared on the monitoring dashboard
- 4.2Verify server metrics have returned to normal thresholds
- 4.3Check that business applications hosted on the server are functioning correctly
Generate Performance Reports
Produce regular server performance and availability reports for management review and capacity planning.
- 5.1Generate weekly and monthly server performance reports
- 5.2Highlight any servers that are approaching capacity thresholds
- 5.3Provide recommendations for capacity expansion or optimisation
Update Monitoring Configuration
Periodically review and update monitoring thresholds, alert rules, and the list of monitored servers to maintain relevance.
- 6.1Review alert thresholds for appropriateness based on recent trends
- 6.2Add new servers to the monitoring platform when deployed
- 6.3Remove decommissioned servers from the monitoring configuration
Quality Checkpoints
Common Mistakes to Avoid
Expected Outcomes
Percentage of time servers are available and operational, measuring infrastructure reliability.
Average time from alert trigger to investigation start, measuring monitoring responsiveness.
Frequently Asked Questions
How are monitoring thresholds set?
Thresholds are set based on vendor recommendations, historical performance data, and business requirements. They should be tuned to avoid both false positives and missed genuine issues.
What metrics are typically monitored on servers?
Key metrics include CPU utilisation, memory usage, disk space and input/output, network traffic, service status, event log errors, and application-specific performance counters.
What is the difference between a warning and a critical alert?
A warning alert indicates a metric is approaching a problematic threshold and requires attention. A critical alert indicates a threshold has been exceeded or a service has failed and requires immediate action.
Want this customised for YOUR business?
We'll tailor every step to your exact operations, tools, and team structure.