Learn (and improve)
The end of an incident should be the beginning of learning. ilert's post-incident analysis and reporting tools enable your team to learn from every incident. Comprehensive timelines, response details gathered from chat channels, and resolution times facilitate a deep understanding of areas for improvement. Utilize templated post-mortem reports to share key findings and transform every challenge into an opportunity for growth.
Why conduct Post-Incident Reviews (Post-Mortems)
What are Post-mortems?
A postmortem, or post-incident review is a blameless analysis conducted after an incident to gain a thorough understanding of what went wrong, why it occurred, and how to prevent its recurrence.
During an incident, the team focuses entirely on restoring service; postmortems provide a platform to evaluate actions and strategies after service has been restored.
They allow us to identify strengths, areas of improvement, and strategies to avoid repeated mistakes in the future.
Conducting a postmortem is not a penalty; it's a collaborative process that involves all responders. While the tech team may lead the analysis, the process's ownership lies with a designated individual, ensuring accountability and driving the postmortem to completion.
A postmortem should be conducted after every significant incident, even if the issue was quickly resolved without intervention. The ideal time for a postmortem is soon after the incident while the event's details are still fresh. It serves as the final step of the incident response process, and any delay can hinder critical learning.
By championing a culture of learning and improvement through postmortems, organizations can enhance their infrastructure and incident response process, ensuring they're better equipped for future incidents.
Postmortem Preparation Steps
1. Assign a Responder Owner and set up a meeting
After the resolution of a major incident, the Incident Response Lead promptly assigns one of the responders to oversee the postmortem process. Although the task of writing the postmortem is a collective effort, having a designated owner is crucial for its effective completion.
The postmortem owner is entrusted with several responsibilities, including:
To facilitate comprehensive analysis and ensure all perspectives are considered, the postmortem meeting should include the following participants:
The inclusion of these stakeholders encourages a holistic examination of the incident, fostering the cultivation of more robust preventive measures.
2. What happened? Incident Timeline and Impact
After preparing for the postmortem, the next step is to construct a comprehensive timeline of the incident and document its impact.
3. Building the Timeline
Focus on documenting the sequence of events, avoiding any interpretation or judgment about the incident's causes. The timeline should start before the incident's onset and continue through to its resolution, and include significant changes in status or impact, as well as key actions taken by responders.
Review the incident log in your communication tool (e.g. Slack or Microsoft Teams) for crucial decisions and actions. Also include what the team didn't know during the incident that, in hindsight, would have been helpful. You can find this information in monitoring, logs, and deployments of the affected services.
4. Documenting the Impact
Record the impact from multiple perspectives. Detail the duration of the visible impact, the number of customers affected, the number of customers that reported the incident, and the severity of the functional impact.
Quantify impact using a business metric specific to your product. For instance, the effect on API errors, slow performance, or slow notification delivery. If necessary, provide a list of all impacted customers to your support team for further action.
Remember, the goal here is to create an objective, factual record of the incident and its impact. Avoid jumping to conclusions or assigning blame; these steps are purely observational and informational.
5. Root Cause Analysis
Once you have a thorough understanding of the incident's timeline and impact, you'll move onto the Root Cause Analysis (RCA). This stage is to explore the contributing factors that led to the incident, bearing in mind that complex systems don't typically fail due to a singular root cause but a combination of interacting factors.
Monitoring Review
Identifying Underlying Causes:
Evaluation of Process:
This stage is also an opportunity to evaluate and improve the incident response process itself.
Summary of Findings:
Pre-work and documentation are essential to ensure a productive discussion during the postmortem meeting, although additional insights may emerge during the conversation.
Remember, the ultimate goal of the RCA is to uncover the multiple interacting elements that led to the failure and to inform preventative measures for the future.
6. Create Action Items
After determining the causes of the incident, you need to decide what steps should be taken to prevent similar issues from recurring. Although it may not always be feasible or worthwhile to entirely eliminate the possibility of such incidents, it's essential to consider improving detection and mitigation measures for future events. This includes better monitoring and alerting systems and strategies to reduce the severity or duration of incidents.
Create tickets for all proposed actions in your task management tool. Make sure to provide sufficient context and proposed direction for each ticket, so the product owner can prioritize the task and the assignee can carry it out efficiently. Each action item should be actionable and specific.
Create tickets for all proposed actions in your task management tool. Make sure to provide sufficient context and proposed direction for each ticket, so the product owner can prioritize the task and the assignee can carry it out efficiently. Each action item should be actionable and specific.
If any proposed actions require further discussion before creating tickets, add these items to the postmortem meeting agenda. These could be proposals needing team validation or clarification. Discussing these in the meeting will help decide the best course of action.
Last updated