menu
FlowTraq > Articles by: Chris Brenton

Author: Chris Brenton

Working With VPC Flow Logs

Chris Brenton
By | August 19, 2017


Facebooktwitterlinkedin

In this article, I’ll perform a deep dive into Amazon’s Virtual Private Cloud (VPC) flow logs. I’ll discuss how to interpret the data they produce, as well as how to find security related incidents. Finally, I’ll provide some advice on automating the review process.

Flow Log Setup

There is already some great advice out there on how to configure your AWS environment to generate flow log data. The Amazon User Guide is a great place to start. Alex Barsamian has a great article on configuration as well as importing flow logs into Flowtraq. Rather than repeating the same steps, I’ll jump right into analyzing the data.

Reading VPC Flow Logs

Flow logs can be created for specific system interfaces or entire VPCs or subnets. Flow logs provide a level of detail similar to Netflow or IPFIX compatible systems, although Amazon does not precisely follow either of these standards.

Figure 1: Sample Flow Log data

Figure 1 shows an example of some flow log data. Each row represents a unique communication flow that was recorded. From left to right, here is a description of each of the fields. Note that each field is space separated.

  • Timestamp in UTC of when this flow log entry was opened
  • VPC flow log version number
  • AWS account ID associated with the logs (blacked out in this example except for the last three numbers)
  • The network interface ID of the instance associated with the traffic
  • The source IP address
  • The destination IP address
  • Source port when applicable (UDP and TCP traffic)
  • Destination port when applicable (UDP and TCP traffic)
  • Protocol being used (1=ICMP, 6=TCP, 17=UDP, etc.)
  • The number of packets transmitted in this flow
  • The number of bytes transmitted in this flow (not including frame information)
  • The time the first packet was see expressed in Unix time
  • The time the last packet finished transmitting expressed in Unix time
  • The action taken on the packet based on security group of ACL settings
  • Status of attempting to record this flow (used for troubleshooting)

Interpreting VPC Flow Logs

Let’s walk through an example by looking at the first row of data in Figure 2. “13:10.03” tells us the log entry was created at approximately 1:10 PM and 2 seconds UTC (the seconds are expressed in hundredths). The current flow log version being used is “2,” and “315” are the last three digits of the AWS account used in this example (prior digits are blacked out). “Eni-768d1092” is the network interface ID of the system where these packets were sent or received. To see which system uses this interface, login to your AWS account and go the the EC2 dashboard. In the left hand menu, select “Network Interfaces” and search for the network interface identified in the log entry.

The next field, “39.88.118.15,” is the source IP address that is sending the traffic. In this case an IPv4 address is being displayed. IPv6 addresses are also supported. Note that flow logs do not clearly identify inbound and outbound traffic direction. I need to know that this is not an IP address that I am using in order to identify that this traffic is headed inbound from the Internet. The next field, “172.31.8.226,” is the IP address of the receiving system. Since this is a private IP address, this must be our cloud instance.

The next value, “42608,” identifies the source port being used by the transmitting system. The next value, “23,” identifies the port expected to receive this traffic. TCP/23 is Telnet traffic. If I do not recognize the source IP address, this is most likely someone probing to see if they can find a Telnet server to exploit. The next value, “6,” confirms that this is a TCP packet.

Our next two fields are where things get interesting. We see that this flow included “2” packets, for a total of “80” bytes of information. Let’s assume the source IP transmitted a packet, and when no response was received, the packet was retransmitted. This would mean both packets would be nearly identical, so it would be safe to assume that each packet contained 40 bytes of data (80/2=40). Both the IP and TCP headers are 20 bytes in size when no options are set. While it is normal for a system to not set any IP options when transmitting a packet, it is unheard of for a modern operating system to not use one or more TCP options. When TCP options are set, the TCP header grows in size. Since no TCP options are set in this packet, it must have originated from some kind of security tool like a port scanner. So if we don’t recognize the source IP, and that system is running a port scanner, they are most likely hostile.

The next two fields are the start and stop time of the network traffic expressed in Unix time. This is an expression of the number of seconds that have elapsed since January 1st, 1970. If you need to convert these values to normal date/time stamps, there are online converts or you can write your own. For example, the start time translates to May 12th, 2017 at 1:10 PM and 3 seconds UTC. The traffic stopped at May 12th, 2017 at 1:10 PM and 57 seconds UTC.

The next field, “REJECT,” tells us how the packet was processed. You can use Amazon security groups and network ACLs to define which traffic patterns you wish to permit through. Which packets are permitted will be based on an intersection of these two security features. In other words, the traffic pattern must be permitted by both in order to reach your instance.

It is worth noting that Amazon does not use “REJECT” in the classic firewall sense. With firewalls, you can typically apply one of three actions on a packet stream:

  • Permit = Let the traffic through
  • Drop = Quietly remove the packet from the network
  • Reject = Remove the packet from the network and return an administratively prohibited error

When Amazon describes a flow as being rejected, what they actually mean is that the packets have been dropped. This is important when you are troubleshooting inbound connection failures as blocked packets will not return an error code. This is also important if you are controlling outbound sessions as blocked traffic patterns will take longer to timeout, and thus use more CPU and memory.

Flow Log Caveats

Flow logs do not record every communication flow. Amazon specifically ignores certain traffic patterns.These include:

  • Traffic going to and from Amazon DNS servers
  • Traffic associated with Windows license activation
  • DHCP traffic
  • Traffic exchanges with the default VPC router

The IP address in the flow log entry may not reflect the IP address in the actual packet. For example in the first row in Figure 1 the destination IP address is recorded as “172.31.8.226.” Since this is a private address, it is not possible for the Internet based source IP address in this entry to be sending traffic to this private IP address and have it arrive at our instance. So our instance must have a public IP address associated. This is the real destination IP address that was recorded in the packet. However flow logs convert all entries to the primary private IP address.

This is true for traffic in both directions. Note that in row five the source IP address is listed as “172.31.8.226.” It would be impossible for a host on the Internet to respond to this private address, so this packet must actually contain a legal IP address. Again, we are simply being shown the primary private IP address associated with this interface.

Flows Versus Sessions

A session typically refers to one complete information exchange between two systems. For example, a “TCP Session” would include all packets in both directions from the initial SYN (connection establishment) to the final FIN/ACK (connection teardown).

A flow is a subset of a session and describes some number of packets moving in one direction.

For example, refer back to Figure 1 and look at row seven. Here we see 6 packets moving from source port TCP/24901 on 73.47.160.75 to destination port TCP/80 on 172.31.8.226. This would be described as a single flow. If you jump to row ten, you’ll see 5 packets moving from source port TCP/80 on 172.31.8.226 to destination port TCP/24901 on 73.47.160.75. Combine these two flows together, and they describe a single session between the two systems.

If you keep looking through the data, you may notice that in row 13 there is another flow between these two systems. While you may initially think that this flow is also part of the above session, note the source and destination ports being used. In row 13, we see 5 packets moving from source port TCP/80 to destination port TCP/24925. Since at least one of the TCP ports being used has changed, this flow is part of a different session. If you look at row 13, you will see 6 packets moving from source port TCP/24925 on 73.47.160.75 to destination port TCP/80 on 172.31.8.226. This flow is also part of this second session.

When you are looking at a single flow entry, you are only looking at the packets that move in one direction. You need to find the complementary flow to account for all of the packets used in the session.

With a full session, you need to look at both the IP addresses as well as the ports being used to communicate.

Analyzing Flow Logs

While flow logs are not very granular, they can be used to test for a number of different security conditions. Here are some possibilities:

  • Monitor for outbound SSH and HTTPS traffic. Either could be an indication of a compromised system that is attempting to call home or install additional software.
  • Monitor for inbound ICMP error packets. A host receiving an excessive number of ICMP errors could be an indication that it is compromised and scanning the Internet.
  • Monitor for an excessive number of unique sessions compared to other source IPs. An attacker who is trying to guess your SSH passwords, or scan your Web site for vulnerabilities, will typically generate more unique sessions than what is normal for that service.
  • Monitor for excessive amounts of session data compared to other sessions. If your system has become compromised, and an attacker is transferring your databases and files, they will typically transfer larger byte counts than is normal for a single IP address.
  • Monitor for large increases in traffic to a specific IP or service. This could be an indication that a denial of service attack has just been launched.

In short, you should become familiar with how your instances normally operate, and define this as your operational baseline. You can then monitor for changes in this baseline in order to prompt deeper investigations.

Manipulating Flow Log Data

In order to extract value from your flow log data, you are going to need to be able to manipulate it. You should be able to search it for specific values, create filters, combine and summarize multiple flows, and create threshold alerts. You have three possible options, work with the Amazon supplied tools, build your own system, or buy a turnkey solution.

Amazon Flow Log Tools

Flow Logs can be manipulated directly within the Amazon graphical interface via CloudWatch. Once you login, simply go to the CloudWatch console. You can select the log group you wish to review, and view the log entries via the Console shown in Figure 2.

Figure 2: The CloudWatch Console can be used to view log entries and perform simple searches

While the search capability is pretty rudimentary, it is sufficient for finding simple patterns in the data. Note the process can be pretty labor intensive. CloudWatch also lets you set filters and alerts to trigger on specific events. These can help reduce the amount of data you must sort through, as well as notify you when a predefined event takes place. However this capability is also pretty rudimentary. For example I can set an alert to warn me if five connection attempts occur within a defined period of time to my SSH server. However I cannot say I only want an alert if all five come from the same IP address. This can make defining a proper threshold problematic. My goal is to trigger an alert if someone is trying to bruteforce my SSH server. However since I don’t have the granularity to group events by IP, I may see false positive alerts if multiple administrators are working on the system. So using Amazon’s tools is probably the least preferred option.

Build Your Own System

Another solution is to build your own logging system and export the flow logs into this system. While many environments already have a centralized logging solution, you should consider separating out both flow logs and/or firewall logs to their own system. Networks see a lot of packets which can create a huge number of flows or log entries on a regular basis. By keeping the system separate you can keep it easier to manage.

Building your own system can seem attractive, as you can get started immediately and for very little cost. For example, a relatively robust system can be build on Linux using Elasticsearch, Logstash and Kibana. This is typically referred to as the ELK Stack. However, what you are getting is a starting framework. You still need to decide how you will configure the system to monitor for suspect activity. This portion can take many hours. Further, these setups tend to be quite tribal. There is usually only one, possibly two, people that understand the full system. So PTO or a separation can leave a company hard pressed to manage the system.

If you are a very small company, or simply need to address personal use, an ELK stack or similar may be the way to go. However if you are an organization that is looking to mature and grow, you should consider a turnkey solution that includes support.

Turnkey Solution

FlowTraq provides a turnkey solution that is capable of processing flow data, once it has been properly converted. As mentioned earlier in this article, Alex Barsamian has a great write up on importing flow logs into Flowtraq.

PCI Compliance: A Step by Step Guide To PCI in Public Cloud

Chris Brenton
By | July 7, 2017


Facebooktwitterlinkedin

As a security consultant, one of the biggest misconceptions I see is companies that process credit cards but feel they do not have to meet the Payment Card Industry Data Security Standard (PCI DSS). This risky assumption gets exasperated with the adoption of public cloud. I’ve seen quite a few organizations that think they have “outsourced” that responsibility to someone else. To be clear, if you are accepting credit cards from your customers, it is 99.9% certain you have some level of culpability in meeting the PCI DSS requirements. In this series of blog posts I will walk you through assessing your exposure, reducing it when possible, and addressing controls that can be a bit of a challenge to comply with in a public cloud environment.

——————————————–

Executive Summary

This blog post is intended to be a step by step guide in achieving compliance with the Payment Card Industry Data Security Standard (PCI DSS) in a public cloud environment. We will walk through where to even start and how to assess which controls are applicable to your organization. We will cover how you can reduce the scope of your attestation. From there we will move into how to perform a gap analysis on your public cloud provider’s environment in order to ensure all of the PCI DSS controls are in place. Finally, we will look at how to solve some of the more difficult controls which revolve around implementing network based intrusion detection within a public cloud environment.

——————————————-

The Risks of Noncompliance with PCI DSS

Experiencing a credit card incident can feel like falling off a cliff. One day it is business as usual and the next you could have a clear line of site to bankruptcy. The repercussions can be financially devastating. Individual banks and credit card institutions may each begin fining you $5,000 to $100,000 a month for non-compliance. An increasing scale is used, so the longer the problem exists the greater the fines. Card institutions may even choose to no longer let you accept credit cards for payment. If a data breach is involved, there may be additional fines levied on a per transaction or per customer basis. If the event is made public, you will also have to deal with negative branding and the erosion of customer trust.

Note that this is experience is very different from most infrastructure problems. For example, if a server runs out of storage or a Web site starts performing slowly, a degradation is experienced that gives you time to react. You can delete files to free up storage which buys you time to increase storage capacity. You can put proxies or a Content Delivery Network (CDN) in front of the Web server giving you time to upgrade the hardware. Credit card incidents are very different in that once the problem becomes obvious you are already on a very destructive trajectory. This is why it is sometimes important to get compliance right from the start.

PCI Compliance – Where to Start

The first step in achieving compliance is understanding exactly how you are processing credit cards. Details matter here, so do not assume or guess. You will want security, development and operations expertise involved in this process. At a minimum, you should be able to answer each of these basic questions:

  • How many credit cards do we process on an annual basis?
  • How do we render the page where our customers input their credit card info? Is that a static page stored on our servers, a script that renders the page but submits the results to a third party credit card processor, or an iFrame that displays a page controlled by a third party credit card processor?
  • When the user hits “Submit”, where does that credit card data go? Is it processed by a script or application running on one of our servers, or does it go straight to a credit card processor?
  • When a transaction is performed, what data is returned by the credit card processor? Does it include the credit card number?
  • Do we store credit card numbers or tokens in our database?
  • How are we performing fraud detection? Can these individuals see raw credit card numbers?
  • Who internally can see raw credit card numbers? Engineering? Finance? Customer Support?

Knowing how many credit cards you process on an annual basis will allow you to determine if you require an external audit for compliance. If you are generating six million credit card transactions or more on an annual basis, you need to contract a Qualified Security Assessor (QSA). If you are below six million, you can perform a self assessment, by completing a Self Assessment Questionnaire (SAQ). While a self assessment is obviously cheaper, it is typically performed by less experienced individuals. This may mean that while you think you are compliant, you are actually not. Again, this may not be made apparent until some form of credit card event occurs.

The rest of the questions help identify the “scope” of the assessment. For example if we are storing raw credit card numbers in a database, then that database, the server it runs on, the network the server connects to, and all of the procedures used to maintain each can be considered “in scope” and required to meet all related PCI controls. If however we are storing tokens, then the database, the server and the network may be considered “out of scope” and thus require less scrutiny.

compliance

Determining Your Initial PCI Scope

Now that we understand the systems and processes responsible for processing credit cards, we need to determine which Self Assessment Questionnaire (SAQ) is applicable to your environment. As of the time of this writing, the current version of PCI is 3.2. You may wish to check the PCI Security Council Web site to see if a newer version has been released. The chart in Figure 1 is from the PCI SAQ Instructions and Guidelines documentation. This identifies which SAQ documentation is applicable to your organization, depending on how you process credit card information.

Figure 1: This chart can be used to identify which PCI SAQ documentation is appropriate for your organization.

 

If you are an e-commerce site, here is what you need to know:

  • If you have completely outsourced credit card processing to an external vendor, and you have implemented iFrames or similar so that your code has no direct impact on credit card processing, you can use “SAQ A” to identify your compliance.
  • If you have completely outsourced credit card processing to an external vendor, but your Web site can potentially impact the integrity of the transaction (example: your Web site serves up a javascript that renders the payment page within the customer’s browser but the results are sent directly to your credit card processor), you can use “SAQ A-EP” to identify your compliance.
  • If none of the listed SAQs describe your setup, you must use “SAQ D” to identify your compliance.
  • If you are processing six million credit cards a year or more, regardless of whether one of the listed SAQs describes your setup, an external auditor will check you against “SAQ D”.

Reducing PCI Scope

Our scope is going to define how many PCI control we are required to meet, as well as what systems and processes must adhere to these controls. This can have a huge impact on the amount of work required to achieve PCI compliance. For example assume we have outsourced credit card processing, but still store credit card information on servers located within an onsite data center. We would need to comply with SAQ D, which describes over 360 different security controls. This means we would need to be able to produce evidence that all of these controls are in place and compliant. These controls impact everything from physical security all the way up to security testing of our software.

Now let’s assume we decide to change our workflow such that all credit cards are processed and stored by our upstream processor. We will no longer store any credit card information onsite, but instead use tokens issued by our processor. In this case we may qualify to use SAQ A-EP for our assessment. This SAQ only describes about 140 controls. So just by changing where we store credit cards, we can considerably reduce the amount of work in achieving compliance. If we further stop rendering the payment page, and use an iFrame to let our credit card processor render the page for us, we may qualify to use SAQ A for our assessment. This will reduce the number of controls to 22. So one of the best ways to reduce the amount of work involved with achieving PCI compliance is to reduce our interaction with any credit card information.

Public Cloud’s Impact on PCI

Another method of reducing scope is to essentially outsource PCI responsibility to a third party vendor. This method is a bit controversial as it has only been an option for about five years. Some may argue that “you can’t outsource liability for protecting credit cards”. The argument is that if I entrust a third party with my customer’s credit cards, and that third party is compromised, I could potentially share in the liability for that breach. While the argument is technically true, from a practical perspective PCI immediately becomes irrelevant if this is ever enforced.

The PCI Council has built an entire infrastructure around supporting organizations that need to process credit cards. They keep track of Individuals and companies that are certified to perform audits, process credit card information and perform security scans. They even list approved software and hardware products. The entire system is built on trust, in that you are okay so long as you are working with PCI verified vendor. This is why one of the questions asked in the SAQs is:

“Merchant has confirmed that all third party(s) handling storage, processing, and/or transmission of

cardholder data are PCI DSS compliant;”

So what happens if that trust breaks down? For example let’s say a cloud provider who has properly received a PCI attestation as a service provider is compromised, and their customers are also found to be liable. If paying a premium to work with a PCI listed service provider yields no reduction in liability or risk over working with a non-compliant vendor, why spend the extra money? If organizations stop caring about PCI compliance in their service providers, why should the service provider invest the time and money into becoming compliant? See where this is going? This becomes a slippery slope into PCI no longer providing any value. So the bottom line is outsourcing PCI responsibility to public cloud vendors is fine, provided you do it properly.

Using Public Cloud to Reduce PCI Scope

When you outsource systems or processes to public cloud vendors, you essentially set up a shared responsibility model for meeting the security controls defined by PCI DSS. How much of that responsibility falls on you versus your vendor depends on the deployment model the vendor is using. Figure 2 shows the three cloud service models and roughly the point of responsibility delineation in each model.

Figure 2: The delineation of responsibility between tenant and provider in each of the three cloud service models.

As an example, PCI defines a number of physical security controls. It also identifies how network and server hardware is to be secured. If you locate your in scope servers within a room at your office, you are responsible for meeting each of these controls on your own. However, if you are working with a cloud vendor, they would assume the responsibility for those controls, thus reducing the work you need to perform in order to be compliant.

Note that even with the SaaS model you cannot completely outsource all of your responsibility for PCI compliance. There will always be some number of controls for which you need to assume responsibility.

Also note that the line of delineation is a bit fluid when it comes to PCI. For example in the above Figure, IaaS cloud providers are responsible for everything from the hypervisor on down. PCI DSS control 11.4.a requires that a network based intrusion detection system be used. Most IaaS cloud providers will not assume responsibility for this control. So even though the “Network” layer is clearing below the line in Provider territory, you will need to implement that control on your own. Later, I’ll provide some tips on how to meet this specific control within a public cloud.

The Right Way to be PCI Compliant in the Cloud

So let’s assume you want to leverage a public cloud to reduce the amount of work involved in meeting all of the PCI controls. You’ve identified the systems and processes that are involved with processing your customer’s credit cards. You would like to move these to a public cloud environment in order to shift responsibility for some of the controls to the provider.

You first need to check to ensure the cloud provider is PCI DSS compliant. You can usually find this information on one of their sales or security pages. PCI DSS compliance is a selling point, so the provider will use compliance with this attestation as a marketing tool. Be careful here, as the vendor’s PCI attestation may not apply to their entire environment. For example Amazon has achieved a PCI DSS attestation for their AWS EC2 environment. However at the time of this writing, they do not yet have one for their AWS Lambda environment. So to ensure you are compliant, all of your resources would need to be running within EC2.

You will need to ask your vendor for two documents. The first is their PCI DSS attestation. Make sure their Attestation of Compliance (AOC) is as a service provider, not a merchant. These are two different documents with two different use cases. The second document is their Report on Compliance (ROC). This documents the provider’s compliance status for each PCI DSS Requirement. Some vendors may also include a scope and responsibility document which is essentially an easier to read version of the ROC.

You can typically get these documents through the provider’s sales or support personnel. It is not uncommon for a vendor to first insist that you be a customer in good standing and that you sign an NDA prior to them releasing the documentation. This is a common practice as the documents include sensitive security information.

First, check the date on the PCI DSS attestation. Attestations are good for one year. In fact you may want to set a reminder to talk to them annually to ensure you always have their latest attestation on file. Next, you will need to perform a gap analysis on the ROC and scope and responsibility documentation id one was provided. These documents will define which PCI controls the vendor is accepting responsibility for, and which controls will be the responsibility of their customers.

For an example, please see Figure 3. Note that the specific PCI DSS control will be referenced, as well as a brief description of the control requirement. The provider will also include a description of whether they are assuming full responsibility for the control, or if the customer maintains some level of responsibility. For example in the Figure the provider is specifying that while they will provide a compliant firewall, the customer is responsible for deployment and management of the firewall.

Figure 3: PCI scope and responsibility example. This document identifies the PCI controls for which the vendor will accept responsibility, and which controls must be maintained by the customer.

Once you complete your gap analysis, you need to implement each of the controls not covered by your vendor.

PCI Compliant Network IDS in Public Cloud

Earlier I mentioned that PCI DSS control 11.4.a specifies that a network based intrusion detection system be in place. Controls 10.6.1 and all of 11.4 do so as well. Further, controls 1.1.3, 12.3.3, 12.5.2 and many of the 10.x controls can also be satisfied by a device that monitors network traffic flow. However we have a bit of a quandary. In a public IaaS environment, the provider manages the network layer. This makes it extremely difficult to monitor packets on their network. There is no public cloud equivalent to span or mirror ports. While we could route all traffic through a dual homed server running network based intrusion detection software, this introduces latency and causes availability concerns.

One of the easiest solutions is to simply process any network flow data the vendor makes available. For example, Amazon makes VPC flow logs available to their EC2 customers. While these are recorded in a non-standard format, it is possible to convert them so the data can be easily monitored. Alex Barsamian has written a great article on how to retrieve and convert VPC flow logs to IPFIX format. However, there are a number of ways to attack this problem, which I will cover in my next blog entry.

Struggling to comply with controls in the PCI DSS framework? FlowTraq can help!

 

compliance

 

Defending Against The Next Round of Ransomware

Chris Brenton
By | May 23, 2017


Facebooktwitterlinkedin

The pace at which new malware gets released into the wild is staggering. While rates have been decreasing over the last two years, we still see 12 or more new variants per hour detected in the wild. Ransomware, currently one of the most pervasive variants, is estimated to be infecting 4,000 systems per day. WannaCry, the latest ransomware variant, is reported to have disabled over 200,000 systems worldwide. While the technology used in the attack changes over time, the initial attack vector, “phishing”, has remained consistent. Phishing is when an attacker attempts to fool the recipient into clicking a link, running an application or handing out their authentication credentials. It can be extremely successful, which is why it is leveraged in three quarters of all attacks. While some malware variations include a worm component, this is only useful after the malware has breached your firewall by getting that first user to infect their own system.

In this article I want to address the social components of that initial phishing attack vector. I’ll talk about why phishing (and malware attacks in general) cannot be fixed simply with technology. I’ll also discuss why most security awareness programs fail at modifying user behaviour, and how you can hack your existing security processes to obtain measurable improvements in reducing your risk to phishing, as well as end user malware infections.

Solving Phishing With Technology

It is only natural that we would try and solve end user infections with technology. For example whenever a mass infection occurs we are told to patch our systems, fortify our perimeter and run malware detection software. While these are all great best practices, none of them address the root cause of the infection, which is a user behaviour issue. As an analogy, think about lane departure and collision avoidance systems that are now being added to modern automobiles. These are great technology improvements that are designed to increase driver safety. However put an operator behind the wheel that makes bad decisions, and eventually they are going to cause an accident. Technology can try and keep a driver from making bad choices, but if they are persistent, the worst is going to occur.

Another problem with the technology approach is that patching only works when the attack vector is known and has been resolved with a patch. In the past we have seen 0-day attacks where a patch was not available to resolve the attack vector. So while we should certainly apply security patches, we cannot rely on them to always keep our systems safe.

There is also a “ownership of responsibility” issue here. By solely relying on technology, we are divesting the end user of security responsibility for their own system. This puts security personnel between the proverbial “rock and a hard place”, with bad guys on one side trying to infect the systems, and the end users who utilize the systems participating in risky behaviour. So by shifting some of the responsibility back onto the user, we create a better defense in-depth posture.

Why We Are Addicted To Email

If you have ever wondered why most people check their email constantly, it is because email has been shown to elicit a dopamine response in the brain. Dopamine has some positive attributes such as driving motivation, curiosity and ambition. However because it can trigger the pleasure centers of the brain it can also drive addictive behaviour. This is especially true when the behaviour results in unpredictable rewards. Email very much provides unpredictable rewards, as you never know when messages will arrive or what kind of content they will contain. So from a medical perspective, our brain chemistry interacts with email in a similar fashion to gambling.

Note that this creates a never ending cycle which is reinforced by brain chemistry. The anticipation of checking email triggers a shot of dopamine. When we view the emails that have been received, your opioid system triggers a pleasure response. Since we like to be happy, a self supporting system gets created where we constantly feel the need to check mail and reveal the rewards it provides. This is why it is so easy to fall into the habit of constantly checking email, but it is so hard to reverse that habit.

Given what is going on in the brain, it should be obvious as to why simply educating our users is not going to fix the problem. While they may know clicking that link is a bad idea, their brain chemistry is driving them to do so anyway. This means to fix root cause, we need to modify that chemical cycle into a more positive behaviour.

Why Most Corporate Security Training Programs Fail

When I’ve made the above argument in the past, many security folks are quick to interject that they are already addressing the behavioural problem via security awareness training programs. The argument is the users have received training and therefore they should know better. However there are a couple of problems with this argument. To start, awareness training typically takes place annually. That leaves 364.25 days for the user to forget and ignore their education. It is well known that to be good at anything you need to train frequently. So we need to be testing users far more frequently than just once a year. Also, “education” is just a small portion of the behaviour problem. As discussed above, the larger hurdle is changing people’s chemically driven habits.

Properly Educating Your Users

I’ve seen a lot of security awareness programs that tell users “don’t open suspicious emails,” but never qualify what actually makes an email “suspicious.” You should develop a training problem that is specific to the email system you are using. If your mail system presents internal and external “from” addresses differently, show examples of both and explain how the user can tell the difference. If your mail system uses DomainKeys Identified Mail (DKIM) and Sender Policy Framework (SPF) to validate email sources, show examples of verified and unverified senders.

Next, train your users on seeing the real URL they will be brought to if they click a link. Most email clients will display this URL when you hover your mouse over the link. For platforms that do not support mice, you can usually do a long press to see the URL. This is supported on both Android and Apple IOS. You should also train your users on how to read a fully qualified domain name. That way they are not fooled when the URL tries to send them to “www.google.com.evilbadguy.ru”.

While the above info should be in your awareness training material, it should also be readily accessible on your document sharing system. When new attacks are detected, leverage them as training opportunities. Refer back to the posted instructions and use them to identify why the email looks suspicious. This gets your users used to referencing the document and prompts you to modify it when needed to incorporate new attack or detection techniques.

Identify Trusted Communication Channels

Both a strength and weakness of email is that it accepts messages from outside of your organization. Make sure you convey to your users that they cannot blindly trust all of the messages they receive via email. A healthy dose of skepticism is extremely valuable. Consider implementing an internal messaging system only accessible to employees such as Slack or Mattermost. Identify that this system is more trustworthy than email. You can use this alternate system for critical messaging, or to warn users when an email is expected that may initially seem suspicious but is actually safe to open.

Test And Train Your Users

As I mentioned earlier, if you want your users to be effective at preventing malware outbreaks, you need to test them regularly. At a minimum, consider doing this on a quarterly basis. You can do the testing yourself or contract an outside third party. Sometimes it is helpful to do both. This ensures no single technique or messaging is used. You want your users responding to the simulated attack the same way they would to a real attack. You don’t want them treating it special because they spot it is only a test.

You also want to collect metrics on the testing. Were all of the emails received? How many were opened? How many people clicked a link? How many ran an untrusted application? This data will be extremely valuable in identifying if your user’s performance is improving. It will also help identify if you have any consistent problematic users.

Hack Your User’s Brains

Earlier I mentioned that the dopamine to opioid cycle created by email makes it difficult to change user habits. Now that we know that system is in place, we can take steps to modify it. I’ve had extremely good luck instituting reward programs for reporting malware or phishing attacks. If a user forwards an email to the Help Desk or Security team, and that email contains a malware or phishing attack that is attempting to target more than one person in the organization, the user gets a reward. While you can use a fixed reward system, I prefer one that varies with the severity of the attack. The reward can be swag, a gift card, public kudos, or some combination of these or other rewards. The catch is the reward only goes to the first user that reports the attack. So make sure you keep track of date/time stamps when users start reporting in.

Note what we have done here. Instead of trying to deny the dopamine to opioid cycle we have simply replaced the trigger. We’ve provided a greater challenge, which also results in a greater reward, for being the first to identify a message as malicious. I’ve seen huge success in implementing this program. It is not uncommon to see the number of people who get fooled by phishing attacks dropping from 40% to less than 10%.

Improve Your Incident Response

Once you have implemented new training, begun testing your users and taken steps to modify their behaviour, it is time to focus on your incident response plan. Have a defined procedure for handling these events. Whenever possible, try to leverage the early heads up to protect other users. For example I’ve seen sites that use G Suite for business write up a quick API script to address malware and phishing attacks. The script takes as input some unique characteristic of the malicious email (subject line, sender, attachment name, etc.). The script then searches all user inboxes for the message. When a copy of the message is found, the message is move to the user’s spam folder or trash. This prevents user who may be fooled by the message from even seeing it.

Finally, once you implement a reward program and start testing users, it is not uncommon to see the false positive reporting rate go up. As users start to perceive all messages with a more suspicious eye, they are going to report messages that turn out to be legitimate. First and foremost, do not dissuade this behaviour! Consider how long it will take your team to clean up after just one nasty outbreak. Compare this to the amount of time it will take to respond to a few dozen, or even a few hundred, false positives. Clearly you come out ahead dealing with the false positives. Simply thank the user for the report, indicate how you can tell the message is legitimate if it is obvious, and offer that they should continue to reach out in the future if they receive any messages they are unsure of.

What To Do When You Get Compromised

Finally, let’s say you follow all of the above advice but malware still gets through and impacts one or more internal systems. How to respond will depend on the type of damage done by the malware attack. For example if it is a ransomware attack, there may be documented processes to help you recover. For most other malware variants, usually the operating system vendor or third parties will make clean up tools available. What steps to take should be clearly identified in your incident response plan.

Another option is to reduce your reliance on needing to remove the malware. Backups are a conventional method of restoring a system to its last functional state. You can also implement a strategy that reduces your reliance on locally maintained software. For example by leveraging G Suites or Microsoft Online Office, your local system only needs a Web browser. All documents and applications are stored and executed online. This means a quick system swap can have an infected user back to fully functional in a matter of minutes.