In part two of this series, we discussed how to implement the identification and containment steps in the DDoS mitigation management framework. In part three, we will look at the steps you should take once you have begun attack mitigation. In particular, we will discuss the importance of performing a proper postmortem in order to identify opportunities for continuous improvement.
In a perfect world, our containment steps will remove 100% of the DDoS attack’s impact. In reality, this is rarely the case.
You need to go through your list of services and identify how much downtime the business can accept. For services that are not tolerant of downtime, you need to identify backup processes. Your Business Continuity Plan (BCP) and Disaster Recovery Plan (DRP) should have contingencies for dealing with a long term DDoS attack. Again, we’ve seen DDoS attacks last as long as 12 days in the wild. So the business needs to be able to continue operation over extended periods of time.
In many ways, this is the most difficult step to get right as there are so many variables. We tend to forget just how reliant we are on Internet connectivity for our day to day business operations. For example, what about remote employees? Will they be able to work if the VPN is offline? Do meetings get scheduled through a third party SaaS service which means employees may not be able to book meeting rooms or receive notification of upcoming events?
Once the DDoS attack subsides, we can breath a sigh of relief. Or can we? A common attack technique is to pivot between multiple attack vectors. For example, an attack may start as an ICMP flood. Once you take steps to mitigate it, the attack transforms into a UDP reflection attack. Once you get that under control, an Internet of Things (IoT) bot army starts exhausting TCP connections. So we need to be sure we have actually entered recovery mode before we start to lean back.
Attackers like to leverage a bit of social engineering when they launch an attack. For example, they may launch the first wave at 5:15 PM local time on a Friday. This is in the hopes that many of the people who are responsible for mitigating attacks will be in their car heading home for the weekend. Once that attack subsides, they wait about an hour which is just enough time for someone to give the all clear signal before launching a second wave of the attack. This is in the hopes of again, catching the team off guard. They then wait until 11:00 PM local time or later to attack again, in the hopes of catching people sleeping. This pattern helps to increase the likelihood of increasing the time it takes to mitigate a specific attack vector. It is also designed to disrupt responders mentally so they are more likely to make a mistake.
When you think you have entered a recovery phase, analyze the mitigation steps that have been taken. With some, you may be able to leave the mitigation technique in place with very little impact (like filtering a few source IP addresses). Some may result in performance degradation during normal periods of operation (like rate limiting). Still, others may have a financial impact to leaving them in place when they are not really needed (like using a third party scrubbing service). Identify the long term impact of each mitigation technique and determine a service cycle for removing them. As they are removed, be sure to closely monitor your metrics for any unsuspected activity.
If “Preparation” is the most important DDoS mitigation management step, “Lessons Learned” is a close second. Attackers are constantly evaluating the success of their attacks in order to make future attacks more effective. You need to be doing the same with your protection techniques. Every DDoS event should be followed up by a postmortem. This should be attended by the people involved with mitigating the attack, as well as those authorized to make changes and improvements.
The postmortem should be run in a similar fashion to a Scrum retrospective. You want to have an open and honest discussion about what went right and what needs improvement. The focus should be on process improvement, not on placing blame on any one team or team member. When a human makes a mistake, you can usually point to a process that needs more refinement or improvement. The trick is ensuring that mistakes only happen once.
As mentioned earlier, DDoS attacks tend to come in waves. You may be asking, why not simply hit you with everything they have right off the bat? This is because attackers leverage mitigation time to extend the impact of their attack. If they hit you with everything all at once, you would simply deploy multiple techniques simultaneously to mitigate the attack. By spreading out their attacks, they can increase the attack’s impact by leverage the amount of time it takes you to detect the attack, identify the attack’s unique characteristics, and implement a mitigation solution.
Think of it this way. Let’s say it takes you 30 minutes to go from initial detection to full mitigation. If I hit you with three attacks at once, the magnitude will be high but the impact will only be felt for 30 minutes. If I launch each attack 30 minutes apart, I’ve now disrupted your network for a full hour and a half. This is more than a sufficient amount of time for your customers to notice, post the outage on Twitter, get confirmed and picked up for bloggers and analysts, etc.
When you encounter an attack vector for the first time, the process is going to be fairly manual. An analyst will need to verify the attack, unique properties will need to be extracted, and runbooks must be followed in order to implement a proper mitigation strategy. However, once you’ve gone through the full process, there is absolutely no reason not to automate future responses. In fact, a key discussion point at each postmortem should be “How can we accelerate our response time if this same attack vector is seen in the future?”.
Consider the “three attacks over an hour and a half” scenario discussed earlier. While a half hour is actually a pretty good response time with a manual process, over multiple attack waves this can create a serious impact to the business. It is not hard for an attacker to change their attack vector every 30 minutes. You can easily find yourself in a position where the bad guys are constantly outflanking you. However, by automating the process, you can easily reduce your response time to five minutes or less. This now makes it far more difficult for the attacker to see if their attack vector is having any impact at all. They no longer have a functional block of time during which they can evaluate your points of vulnerability. So automation not only improves your response, it makes implementation more difficult for those launching the attack.
So a good DDoS mitigation management system is key to reducing response time. A proper system should have the ability to monitor traffic patterns for those unique characteristics identified by your analysts. The system should then be able to follow a defined set of rules to identify the proper mitigation technique to be used for this specific situation. Finally, the system should be capable of implementing your runbook steps by deploying the proper mitigation response. This reduces your response time as much as possible. It also ensures that your analysts stay focused on emerging threats rather than constantly revising old ones.
In this series, we discussed the different steps included in the DDoS mitigation management framework. We discussed the importance of proper preparation, as well as performing a postmortem. We also discussed the importance of reducing mitigation time as much as possible, as well as implementing a proper DDoS mitigation management system.