IT & Systems

Updated March 2026

System Downtime Response

A procedure for responding to unplanned system outages to restore service as quickly as possible while keeping stakeholders informed.

Purpose

To minimise the business impact of unplanned system downtime through rapid detection, structured response, and clear communication, ensuring systems are restored within defined service levels.

Scope

Covers all unplanned outages of business-critical systems including servers, applications, databases, network services, and cloud platforms.

Prerequisites

System monitoring and alerting configured for all critical systems
Defined severity levels and response time targets for different systems
Escalation paths and on-call rosters for infrastructure and application teams
Stakeholder communication templates for outage notifications

Step-by-Step Procedure

Detect and Confirm the Outage

Identify the outage through monitoring alerts or user reports and confirm the scope and severity of the downtime.

1.1Receive the alert from the monitoring system or user report
1.2Confirm the outage by checking the affected system directly
1.3Determine the severity level based on the number of users and business processes affected

Systems Operations Analyst

5 minutes

Monitoring Dashboard, IT Service Desk System

Notify Stakeholders

Send an initial notification to affected users and management advising of the outage and expected investigation timeline.

2.1Send an initial outage notification using the communication template
2.2Notify the IT management team and business stakeholders
2.3Set up a status update schedule for ongoing communication

IT Service Desk Analyst

10 minutes

Communication Platform, Notification Templates, Email

Diagnose the Root Cause

Investigate the cause of the outage by examining system logs, infrastructure status, and recent changes.

3.1Review system and application logs for error messages
3.2Check hardware and infrastructure status for failures
3.3Review the change log for any recent changes that may have caused the outage

Systems Operations Analyst

15 to 60 minutes

Log Analysis Tools, Monitoring Dashboard, Change Management System

Implement Resolution

Apply the fix to restore the system, whether it is a restart, failover, rollback, or hardware replacement.

4.1Implement the identified fix to restore service
4.2If the root cause cannot be immediately fixed, implement a workaround
4.3If necessary, escalate to vendors or external support for assistance

Infrastructure Engineer

15 minutes to 4 hours

Server Management Tools, Vendor Support Portal

Verify Service Restoration

Confirm that the system is fully operational and users can access services normally.

5.1Verify the system is responsive and all services are running
5.2Ask affected users to confirm access is restored
5.3Monitor the system closely for any recurrence

Systems Operations Analyst

15 minutes

Monitoring Dashboard, IT Service Desk System

Communicate Resolution and Close

Send a final notification confirming service restoration and conduct a post-incident review for significant outages.

6.1Send the service restored notification to all affected stakeholders
6.2Document the outage details, root cause, and resolution in the incident record
6.3Schedule a post-incident review for significant or recurring outages

IT Service Desk Analyst

15 minutes

Communication Platform, IT Service Desk System

Quality Checkpoints

Outage severity is classified within five minutes of detection

Stakeholders are notified within fifteen minutes of confirmed outage

Regular status updates are provided during the outage

Common Mistakes to Avoid

Not communicating promptly with affected users, leading to confusion and frustration

Failing to check the change log for recent changes that may have caused the outage

Not conducting a post-incident review after significant outages, missing prevention opportunities

Closing the incident before confirming service is fully restored for all affected users

Expected Outcomes

Mean Time to Recover

Average duration of unplanned outages from detection to service restoration.

Stakeholder Notification Time

Average time from outage detection to first stakeholder notification.

Frequently Asked Questions

How can we reduce the frequency of unplanned outages?

Regular maintenance, proactive monitoring, timely patching, capacity planning, and post-incident reviews all contribute to reducing the frequency and impact of unplanned outages.

What is the difference between planned and unplanned downtime?

Planned downtime is scheduled in advance for maintenance or upgrades and is communicated to users beforehand. Unplanned downtime is unexpected and results from failures, which requires the downtime response procedure.

Who should be notified of a system outage?

Affected users, IT management, and business stakeholders who rely on the system should all be notified. For major outages affecting many users, an organisation-wide notification may be appropriate.

Want this customised for YOUR business?

We'll tailor every step to your exact operations, tools, and team structure.

View Our Services