Prepare (for anything)
Last updated
Last updated
(c) 2011 - 2024 ilert GmbH
Preparation is the cornerstone of effective incident response. The more you're prepared, the better your response will be when incidents occur. It involves setting up systems and structures that facilitate efficient detection, notification, and resolution of incidents. Here's what you need to do:
The first step to effective incident management is setting up tools to monitor your systems and applications. These tools provide real-time visibility into your IT environment, allowing you to detect anomalies, performance issues, and potential incidents as soon as they occur. Setting up proper monitoring is a vast topic and highly depends on the nature of your infrastructure. Although it's crucial, we won't cover it in detail in this guide due to its scope and variability across different systems.
Establishing an on-call team and an appropriate rotation is a crucial step in the preparation for incident response. Having a dedicated set of people who are trained and ready to respond to incidents can drastically reduce your response time and prevent escalations.
On-call rotations help prevent burnout by sharing the load among team members, ensuring no one individual is always on duty.
Setting a rotation schedule that suits your team's specific needs can be challenging but is critical to maintain a healthy work-life balance while ensuring coverage.
It's worth noting that how you structure your on-call team and rotations may vary depending on your organization's size, needs, and resources. While we're touching on the basics here, we will dive into the various models of organizing on-call teams in detail in the On-call organization models chapter later in this guide.
Enable your on-call team with the capability to manage their own schedules and rapidly handover shifts when required. This level of autonomy fosters efficiency and adaptability, ensuring quick adjustments to any changes in circumstances.
Below is an example from the ilert mobile app where you can see your current on-call status and quickly take someone else's on-call.
For critical services, always designate both primary and secondary on-call personnel. The secondary person can step in if the primary responder is unable to address the incident, ensuring that there's always someone available to handle emergencies. Set a proper escalation timeout depending on the criticality of the service.
For critical services, we recommend 5 minutes. Also consider a third level of escalation, e.g. your entire team. Below is an example escalation policy with three levels and automatic escalation after 5 minutes.
If your team is globally distributed, consider a follow-the-sun model. This approach allows on-call responsibilities to be passed between time zones, ensuring that your team members handle incidents during their daytime hours, reducing stress and fatigue.
However, the success of a follow-the-sun schedule relies not only on the distribution of team members across various time zones but also on each member's proficiency in handling potential incidents. Each participant needs to have adequate knowledge and technical capabilities to act as an effective responder for the service in question.
In situations where teams themselves are distributed across time zones, and each team member is proficient in maintaining and troubleshooting the system, the follow-the-sun model can be a game-changer.
It ensures that on-call responsibilities are shared more equitably and that incidents are addressed more promptly, ultimately contributing to a better, more reliable service for your users.
The screenshot below shows an example of a follow-the-sun-schedule with a team in the US and a team in the EU.
Connect your monitoring and observability tools with your alerting and on-call management tool. This integration ensures that when an anomaly is detected, an alert is generated, and the appropriate on-call team member is notified immediately. Below are a few things to consider when setting up alerting:
Keep primary system infrastructure separate from alerting system
Don't let an issue in your primary system infrastructure prevent you from getting the alert. Keeping your primary system infrastructure and alerting system separate ensures that you'll still receive alerts even if your primary system encounters problems.
But separation is only the first step. It's also essential to establish mechanisms that confirm the continuous, seamless communication between your monitoring and alerting systems. One reliable way to do this is by implementing heartbeat monitoring.
In a heartbeat monitoring setup, your monitoring system sends regular "pings" to the alerting system. If the alerting system doesn't receive these pings at the expected intervals, it automatically triggers an alert. This precaution ensures you're immediately notified if there's a disruption between your monitoring and alerting systems, preventing a silent failure from escalating into an unnoticed incident.
Remember, a robust alerting system is only as good as its ability to receive and respond to problems in your primary system. Ensuring separate infrastructures and continuous communication is key to maintaining this vital lifeline.
Ensure your incident response is resilient to internet outages by setting up a minimum of two diverse alerting channels. Begin with push notifications as your primary method; given our near constant access to smartphones, it's an immediate and usually sufficient alerting medium.
Make sure that critical alerts cut through the noise and are not silenced by 'Do Not Disturb' (DND) modes. The ilert mobile app, for instance, supports critical push notifications. These notifications are specially designed to bypass DND settings, ensuring that you're alerted no matter what.
In the event the push notifications fail, switch to more assertive methods like phone calls or SMS notifications. Add all caller IDs from your alerting system to your phone's contact book. Configure these contacts in your phone's settings to bypass DND, ensuring that these critical alerts don't go unheard. The ilert mobile app conveniently syncs and updates these contacts for you, keeping your alerting system well-integrated with your phone.
In this process, it's also vital to incorporate bi-directional alerting channels. This means acknowledging an alert should be as seamless as receiving it, right on the same platform. For example, if you receive a phone call alert, acknowledging it could be as simple as pressing a digit. Once an alert is acknowledged, the system should ensure that it doesn't escalate to your other devices or to other people, preventing redundant notifications.
Alerts should be initiated and repeated every minute until the set escalation time. If no response is recorded after three attempts, the incident should be escalated, signaling your inability to respond.
This multi-channel approach, paired with the right tools, ensures that no critical alert goes unnoticed and the response process remains uninterrupted, regardless of external factors.
Set up a way to report incidents manually
Establish a dedicated hotline for manual incident reporting. This hotline should be capable of forwarding calls to the on-call team according to the on-call rotation.
This not only allows for immediate incident reporting but also ensures that incidents get routed to the right people swiftly. Alternatively, you could enable users to report incidents directly from their daily chat tool.
Using a single system to route both alerts and incoming phone calls to your engineers simplifies the process, reducing confusion, and streamlining communication.
Remember, preparation isn't a one-time event; it's a continuous process. As your systems and teams evolve, your preparation must adapt accordingly.
Regularly review and update your incident response plans and tools to ensure they remain effective and aligned with your current needs and capabilities.