Machine Learning Bro Ids
Machine Learning Bro Ids
Machine Learning Bro Ids
Machine Learning,
Bro and You!
Pandas DataFrame with all the right types and timestamp as index
What’s the intended audience?
• People who like Python
• Interested in Pandas, scikit-learn, Spark, Parquet
• Hate seeing examples on Iris data or TF-IDF
• Frustrated when trying to use your own data
• Want easy examples using Bro!
Are you going to show super scalable blah?
• Presentation will talk about Pandas, Scikit-Learn
• We also have classes/notebooks on:
• Kafka
• Parquet
• Spark
• We’ll show a some of this stuff…
● Big Picture
● Software Bridges (BAT)
○ Bro to Python
○ Bro to Pandas
○ Bro to Scikit-Learn
● Example: Anomaly Detection
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
What is BAT?
A simple to use Python Module that
makes getting Bro data into popular data
Bro Analysis analysis and ML package super easy!
Tools
$ pip install bat https://github.com/Kitware/bat
Who’s Kitware?
● ~130 people, offices around the world
● Developing and supporting open
source software for 25 years
● New information security program
● Summer Internships available J
You guys haven't seen
Talk Outline my rabbit have you?
● Big Picture
● Software Bridges
○ Bro to Python
○ Bro to Pandas
○ Bro to Scikit-Learn
● Example: Anomaly Detection
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
Hello World
from pprint import pprint
from bat import bro_log_reader
Step 1: $ pip install bat
Step 2: Write a few lines of code # Run the bro reader on a given log file
reader = bro_log_reader.BroLogReader('dhcp.log')
Step 3: There is no step 3... for row in reader.readrows():
pprint(row)
● Big Picture
● Software Bridges
○ Bro to Python
○ Python to Pandas
○ Pandas to Scikit-Learn
● Example: Anomaly Detection
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
Anomaly Detection
Popular Mental Images
Challenges: I-Forests
● Streaming Data
● Data Volume Anomalous Output:
● Categorical and Numerical Types DNS/HTTP ● 1-5% of data
● Efficient DataFrame/Matrix conversions ● Uncommon (by def)
* http://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb ● Good Base Camp
Isolation Forests: Anomaly Detection
https://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb
Anomalous to Interesting
Organization + User Feedback
Anomalous Example: 10k rows clustered and
DNS/HTTP organized for displayed to user *
Interesting
Organization and
Anomalous Clustering
Display and
Challenges: Feedback*
● Streaming Data
● Organization and Clustering
Interesting
● Engaging the Human Output:
● User Interface and Feedback* ● Fraction of 1%-5%
● Clustered/organized
* Feedback will be used in the next phase of the pipeline
* http://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb
● Ready for Feedback*
Demo: Anomaly Detection
https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Scikit.ipynb
https://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb
Demo: Bro to Kafka to Spark
https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Kafka_to_Spark.ipynb
Demo: Bro to Parquet to Spark
https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Parquet_to_Spark.ipynb
Questions/Comments?