Incident management & on-call schedules with Microsoft Sentinel
Are there benefits in using a separate incident management platform together with a SIEM? Let's investigate.
24/7 security incident management requires more resources, increased flexibility, and a higher level of preparedness than only working during office hours.
It also creates more demands for the technical flow of alert notifications and incident management.
A team working during daytime can put some trust in humans identifying alerts as they come, but the fewer people are actually at the console at a given time, the more alert notifications matter.
For an average in-house SecOps team, usually the main option for handling 24/7 detection and response is starting an on-call rotation. This creates the need for some kind of an automated paging system that goes beyond e-mail: SMS at minimum, maybe automated robot calls too.
For a 24/7 SOC the situation is a bit different, as they have more “eyes on the ground” at all times. Even then a dedicated alert and incident management toolkit often exists for good reasons, though sometimes this need may be handled by the SOAR platform.
Especially in on-call scenarios, the lack of a proper toolkit for handling alerts creates a high chance of incidents going undetected or not being resolved quickly.
Incident handling in Microsoft Sentinel
Microsoft Sentinel itself provides quite many incident and alert handling capabilities out of the box:
Incident Owner - The Azure AD identity (user or group) that is currently responsible for responding to a specific incident.
Incident Task - A checklist feature that can help standardise and formalise the list of activities required to respond to a specific incident.
Playbooks - Playbooks are the core SOAR component in Microsoft Sentinel, based on workflows built in Azure Logic Apps. They can do a lot: API calls, webhooks, notifications, enrichment, different kinds of response activities.
Automation Rules - Static incident task triggers that don't require interactivity: Run Playbook, Assign Owner, Change Status, Change Severity, Add task, Add tags.
When and how to use a dedicated tool?
Built-in capabilities in Sentinel can be used for a lot of the classic alert notification requirements when the workflow is mostly static. The classic example: running a Playbook via an Automation rule to send a Teams message.
Handling more complex notification workflows based on changing on-call rotation schedules and different notification methods is more difficult with the built-in features.
Alerting with a Logic App becomes difficult if you need to know the current status of the team and the current notification expectation. Who is available and are they on the console or not?
This is where dedicated management tools often step in. Some commercial examples of such tools are Opsgenie, Squadcast and Pagerduty. There are also some open-source alternatives with varying capabilities.
So what do you get with such a tool? Usually the core capabilities include at least the following:
Flexibility for different workflows in alerting, routing and escalation.
Consolidate notification flows between SIEM and other tools.
Alert suppression and deduplication capabilities.
APIs and webhooks for integrations.
SLO and other metric reporting for individuals and teams.
Below is a simplified example of an incident notification flow with Sentinel and an external tool.
With this workflow the on-call person gets paged based on the current notification settings and the Sentinel incident owner gets changed to the currently on-call persons identity. Likely some comments and tags are also added to the incident.
The key thing in deploying this architecture is to make it truly bi-directional, so that both ends are always in agreement with the incident status, severity and ownership.
Details depend on the selected toolkit. In some cases you may to use the incident management tool’s API for handling operations in both directions, in some cases updating Sentinel status will happen with a webhook from the external tool.
When to alert the on-call analyst?
Alerting is always something that needs careful consideration. Just like during the daytime, the SecOps team should focus on actionable notifications to avoid alert fatigue.
We can borrow a couple pointers from the SRE book:
Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued.
Every page should be actionable.
Every page response should require intelligence. If a page merely merits a robotic response, it shouldn’t be a page.
I think these are good guidelines for making making notification decisions for threat detections in SIEM. There is also a whole section on being on-call in the book.
Of course it’s not just about tuning the alerting thresholds. We should aim for high-quality and actionable outputs “at the source”, meaning in the detections themselves.
But especially for on-call situations it is often necessary to decouple detections from alerting a bit and make different decisions based on the teams availability and capability during different times.
Closing
This was a quick look at a complex topic, I will return to this with a few practical examples and Playbook demos later.
Do you have experience working with external alerting tools in the SIEM context, or are interested to evaluate the possibility? Let me know and we can keep the discussion going.