System Downtime Response
A procedure for responding to unplanned system outages to restore service as quickly as possible while keeping stakeholders informed.
Purpose
To minimise the business impact of unplanned system downtime through rapid detection, structured response, and clear communication, ensuring systems are restored within defined service levels.
Scope
Covers all unplanned outages of business-critical systems including servers, applications, databases, network services, and cloud platforms.
Prerequisites
- System monitoring and alerting configured for all critical systems
- Defined severity levels and response time targets for different systems
- Escalation paths and on-call rosters for infrastructure and application teams
- Stakeholder communication templates for outage notifications
Step-by-Step Procedure
Detect and Confirm the Outage
Identify the outage through monitoring alerts or user reports and confirm the scope and severity of the downtime.
- 1.1Receive the alert from the monitoring system or user report
- 1.2Confirm the outage by checking the affected system directly
- 1.3Determine the severity level based on the number of users and business processes affected
Notify Stakeholders
Send an initial notification to affected users and management advising of the outage and expected investigation timeline.
- 2.1Send an initial outage notification using the communication template
- 2.2Notify the IT management team and business stakeholders
- 2.3Set up a status update schedule for ongoing communication
Diagnose the Root Cause
Investigate the cause of the outage by examining system logs, infrastructure status, and recent changes.
- 3.1Review system and application logs for error messages
- 3.2Check hardware and infrastructure status for failures
- 3.3Review the change log for any recent changes that may have caused the outage
Implement Resolution
Apply the fix to restore the system, whether it is a restart, failover, rollback, or hardware replacement.
- 4.1Implement the identified fix to restore service
- 4.2If the root cause cannot be immediately fixed, implement a workaround
- 4.3If necessary, escalate to vendors or external support for assistance
Verify Service Restoration
Confirm that the system is fully operational and users can access services normally.
- 5.1Verify the system is responsive and all services are running
- 5.2Ask affected users to confirm access is restored
- 5.3Monitor the system closely for any recurrence
Communicate Resolution and Close
Send a final notification confirming service restoration and conduct a post-incident review for significant outages.
- 6.1Send the service restored notification to all affected stakeholders
- 6.2Document the outage details, root cause, and resolution in the incident record
- 6.3Schedule a post-incident review for significant or recurring outages
Quality Checkpoints
Common Mistakes to Avoid
Expected Outcomes
Average duration of unplanned outages from detection to service restoration.
Average time from outage detection to first stakeholder notification.
Frequently Asked Questions
How can we reduce the frequency of unplanned outages?
Regular maintenance, proactive monitoring, timely patching, capacity planning, and post-incident reviews all contribute to reducing the frequency and impact of unplanned outages.
What is the difference between planned and unplanned downtime?
Planned downtime is scheduled in advance for maintenance or upgrades and is communicated to users beforehand. Unplanned downtime is unexpected and results from failures, which requires the downtime response procedure.
Who should be notified of a system outage?
Affected users, IT management, and business stakeholders who rely on the system should all be notified. For major outages affecting many users, an organisation-wide notification may be appropriate.
Want this customised for YOUR business?
We'll tailor every step to your exact operations, tools, and team structure.