How to Analyze a Firewall Ruleset with Hadoop
Note: This is an old blog post and the code repository is not being actively maintained.
Ruleset Analysis is a tool for analyzing firewall log files to determine what firewall rules are in use and by what kind of traffic. The first release supports the Cisco ASA and FWSM firewalls. The analysis is built as Hadoop Streaming jobs since the log volume to analyze easily can reach hundreds of gigabytes or even terabytes for very active firewalls. To make useful results the logs analyzed must span a time period of at least a couple months, preferably six or twelve months. The analysis will tell you exactly what traffic was allowed by each of the firewall rules and when that traffic occurred.
A common use case for Ruleset Analysis is to use the insight produced to reduce the size of large firewall rulesets. Armed with knowledge about when a rule was last in use and by what traffic, it becomes easier to determine if the rule can be removed. Rules with no hits in the analyzed time span are also likely candidates for removal. In addition, Ruleset Analysis can be used to replace a generic rule with more specific rules. Traffic counters are often used to check what rules are in use, but I explained some of their shortcomings in my previous post.
How to install requirements
For instructions on how to install the prerequisites required for the analysis to work (mostly Python modules), see the README at Github.
Sample results
Here is an example of the output for each firewall rule:
This says that outbound access to websites on port 8080 got seven hits during the last year, but only from two distinct sources. An internal machine initiated six of those connections to one external server on port 8080 in half an hour on June 6th. All in all, this tells us that the rule is rarely in use and may be a candidate for removal.
The second line of the output shows the access-list entry in the original Cisco syntax. Note that Ruleset Analysis supports object-groups and will expand the list of objects in the object-group to create distinct rules. For instance, here it has expanded the object-group Web to the TCP port 8080 (and other ports not shown here). For each object in an object-group the preprocessor creates a distinct rule object, effectively expanding the object-group to separate objects. The benefit of this is that Ruleset Analysis is able to find out which objects in an object-group are in use and which are not, so objects not in use can be removed from the object-group (and therefore from the ruleset).
How to run the analysis on Hadoop
To be able to run the analysis you need the firewall config, log files and access to a Hadoop cluster.
Clone the repository from Github:
Preprocess the config file to extract access-lists and generate ACL objects:
Submit the job to the Hadoop cluster with the path to the firewall log files in the Hadoop filesystem HDFS (wildcards allowed):
The output from Hadoop Streaming is shown on the console:
Note the name of the output directory on the last line of output, “output-20150104-1124_RulesetAnalysis” in this example. You’ll use that to fetch the results from HDFS. Insert the name of the output directory in the variable below:
With the job results now on disk, the last step is to run postprocessing to generate the final report and view it:
Manually test the analysis on a small log volume
For small log volumes and trial runs, the analysis can be run with no Hadoop cluster (no parallellization), like this:
Clone the repository from Github, if you haven’t already:
Preprocess the config file to extract access-lists and generate ACL objects:
Pipe the firewall log through the Python mapper and reducer manually:
Postprocess the results to generate the final ruleset report and take a look at it: