On any given day, a typical networked host will send about 30MB and receive about 200MB. About 300,000 packets are switched. During peak times, the average workstation initiates two to four network UDP or TCP sessions per second, and each session averages 34KB in size, roughly 100 packets. What’s more, these sessions are negative-exponentially distributed with regard to packet count. What does that mean? It means there are a lot more very short sessions of only a couple of packets, and then there are lengthy sessions with lots of packets.
When routers use sampling for NetFlow generation, an interesting thing happens. The sampling is done on a packet-count level, so a 1:512 sampling rate will grab roughly every 512th packet to update the flow state tables.
This is great for reducing CPU load. But it is not so great at reducing flow update rate. Here’s why: With an average session size of roughly 100 packets, each sampled packet is very likely to be part of a flow that is not yet in the state table. This means an entry is created, which will lead to a flow update being sent. Compare this to 1:1 unsampled flow generation, where most of the packets will go toward updating existing entries in the flow state table. Flow state tables are typically exported when a flow is 60 seconds old, or the table is full, and the old ones need to be purged.
Leaving the exact math out for clarity, if unsampled flow generation results in a flow rate of X, then a 1:512 sampling results in a roughly 1/5th of the NetFlow being generated. Not 1/512th.
This is the intuitive answer, and the true results of sampling depend much on the precise mix of traffic present on the network. Also, some routers will use adaptive flow sampling rates to keep their flow export rates constant. This means that at busier times, the granularity of the data becomes less and less. Although this is nice for CPU time considerations on the router’s end, it does not help much that the roughest data is collected during the heaviest attack!
Consider this when integrating FlowTraq with a DDoS mitigation appliance, such asA10 Networks’ TPS Thunder devices. A saturated 10Gbps link using no sampling will result in roughly 25,000 flow record updates per second (fps), which is the nominal throughput of a single FlowTraq unit. However, that same server will be able to temporarily handle peak rates 4x to 6x that. You wouldn’t want to run that much traffic through it all the time, but temporary peaks can be handled for minutes or even hours at a time. When planning for very high-volume attacks, the recommended number of FlowTraq units needed to handle detection is usually based on normal expected traffic flows. In contrast, the number of DDoS mitigation appliances is tied directly to the expected maximum traffic volumes. This allows additional capacity for handling very big attacks when they occur. When building in redundancy, one should note that FlowTraq clusters are already internally redundant.
Some customers will collect NetFlow from many more locations in their network than their DDoS mitigation devices will be protecting – that means that flow rates could be higher as traffic is seen from multiple vantage points. This raises the flow rate, and should be taken into consideration to optimize the network with the right number of FlowTraq servers.
So based on all these factors, the right way to determine the FlowTraq unit count is based on the typical (peace-time) network throughput of a customer’s network. But there is a caveat: 10Gbps of DNS requests may quickly become >1Mfps, while 10Gbps of streaming video may only be 5Kfps. Typical network traffic mix at 10Gbps gives 25K, which can be serviced by a single unit of FlowTraq.
Using a simple ratio of one FlowTraq per 10Gbps provides a quick estimate. Note that the maximum throughput of a DDoS appliance is not so much relevant to the ideal number of recommended FlowTraq units. Unfortunately, the true number will depend ultimately on the amount of updates coming in, which is something we generally learn during a proof-of-concept phase, and sometimes after the system goes live with real network traffic. (Fortunately, FlowTraq can easily scale up to handle larger volumes if needed.)
Sampling is simply not the correct approach to reduce cost or gain best visibility. I encourage customers to design for 1:1 unsampled flow, because it builds in a safety margin during the biggest of attacks – and it often turns out that the total FlowTraq unit count is quite reasonable and cost-effective.
Have questions? Want to know more? Reach out to us at firstname.lastname@example.org.