Most networking in the AWS VPC cloud is familiar and intuitive to network engineering veterans. And if you think about that, that’s kind of a miracle. All you have to do is pick a region, declare a VPC, and suddenly you have an internal IP space, Elastic IPs for communicating with the world, and virtual routers, firewalls, NAT devices, and more.
All these abstractions behave uncannily like the brick and mortar hardware they’re modeling, making them easy to reason about and design with. And interaction with them is neatly packaged up as APIs, command-line tools, and mostly-decent user interface, enabling network engineers to forget many of the arcane details they used to have to worry about (“how do I configure a VLAN on a Nexus 5000?”)
The black boxes are so nice and shiny that it’s easy to forget all the engineering under the hood. The trouble with black boxes, though, is that sometimes what’s inside surprises you. And sometimes you need a little more control than the knobs, switches, and ports on the outside of the box.
This is especially true in network security and visibility, where 1) your threat detection platforms and analysis tools absolutely need real-time, forensically-accurate network data, and 2) your analysts need access to first-class tools across the whole network, especially the stuff in the cloud.
Recognizing those two needs, about a year ago I released a free AWS Lambda tool to convert and forward Cloudwatch flow logs to Netflow v5. My honest hope was that people would use it to forward their VPC traffic from AWS to their favorite flow analysis platform and use that to do some good in the world (and perhaps that some of those might discover FlowTraq and find a new favorite).
Based on customer feedback, the biggest issue with this approach is the reliance on Cloudwatch logging. Cloudwatch flow logging is an AWS abstraction with some pretty serious limitations, the biggest one being that AWS only promises flow logs every ten minutes or so, and even then the promise isn’t a strong one.
Ten minutes’ lag time before detecting a DDoS attack or other security incident isn’t that far from not detecting it at all. It became clear we needed to take another approach to the AWS VPC security problem.
(Incidentally, while Google was late to the party having just rolled out their version of VPC flow logs a few weeks ago, they really showed Amazon up here, with flow updates every five seconds. I plan to discuss their offering in a future post.)
If you’re committed to the AWS ecosystem, where do you go from here? Well, in a traditional network environment, if your existing hardware doesn’t support flow generation, you have a couple options:
Unfortunately, there’s no abstraction for a tap in the AWS VPC. So the first option is out.
As for the second option: it turns out you can’t build a working bridge in AWS, either. That marvelous simplicity-in-complexity that is AWS that was I referring to earlier? It means that some networking concepts don’t translate. In this case the concept that isn’t a perfect match for on-premise networking is ARP.
In an AWS VPC, when host “A” arps for host “B” the response doesn’t come from “B”. In fact, the initial request never even arrives at “B”. It’s captured and handled by something called the “AWS mapping service”. Simply put, there isn’t a traditional broadcast domain in AWS at all. No, not even between machines on the same VPC subnet!
(Want to learn more about the mapping service and other AWS engineering marvels? Check out Eric Brandwine’s talk “A Day in the Life of a Billion Packets” at AWS re:Invent 2013. It’s fascinating stuff!)
What this means is, to get real-time, forensically accurate, reliable flow out of AWS VPCs we actually need to roll our own router instance. Yes, really. It’s not as bad as it sounds, and it will start paying dividends in networking security right away.
Credit to: https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_NAT_Instance.html