Machine Learning Bro Ids

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Data Analysis,

Machine Learning,
Bro and You!

Together again like never before...


Presenter
Brian Wylie
Working at Kitware Inc.
Background in Information Security and Vis
Likes open source and mixed Corgis
What’s the point of this talk?
Provide software classes and examples that make
the path from Bro Network data to the popular data
analysis and machine learning libraries easy.

When you say easy, what do you mean?


One line of code:
Bro Log à Pandas DataFrame

Pandas DataFrame with all the right types and timestamp as index
What’s the intended audience?
• People who like Python
• Interested in Pandas, scikit-learn, Spark, Parquet
• Hate seeing examples on Iris data or TF-IDF
• Frustrated when trying to use your own data
• Want easy examples using Bro!
Are you going to show super scalable blah?
• Presentation will talk about Pandas, Scikit-Learn
• We also have classes/notebooks on:
• Kafka
• Parquet
• Spark
• We’ll show a some of this stuff…

Please see tomorrow’s great Talk J


3:30 p.m. Spark and Bro: When Bro-Cut Won’t Cut It
Eric Dull, Joseph Mosby, & Brian Sacash; Deloitte & Touche
Talk Outline What is the best way
to do data science on
Bro Network data?
● Big Picture
● Software Bridges
• Bro to Python
• Bro to Pandas
• Bro to Scikit-Learn
● Example: Anomaly Detection I’m not sure…
Ahhh!!!
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
Security Data → Data Analysis and Machine Learning
Data flow diagram of how Pandas and Scikit-Learn are used.
● DataFrame = Pandas
● Numpy array = Scikit-Learn
JSON Agents Packets Logs Bro IDS

DataFrame numpy array

Stats Filtering Grouping Vis/Plots Clustering Anomaly Stats ML


You guys haven't seen
Talk Outline my rabbit have you?

● Big Picture
● Software Bridges (BAT)
○ Bro to Python
○ Bro to Pandas
○ Bro to Scikit-Learn
● Example: Anomaly Detection
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
What is BAT?
A simple to use Python Module that
makes getting Bro data into popular data
Bro Analysis analysis and ML package super easy!

Tools
$ pip install bat https://github.com/Kitware/bat

Who’s Kitware?
● ~130 people, offices around the world
● Developing and supporting open
source software for 25 years
● New information security program
● Summer Internships available J
You guys haven't seen
Talk Outline my rabbit have you?

● Big Picture
● Software Bridges
○ Bro to Python
○ Bro to Pandas
○ Bro to Scikit-Learn
● Example: Anomaly Detection
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
Hello World
from pprint import pprint
from bat import bro_log_reader
Step 1: $ pip install bat
Step 2: Write a few lines of code # Run the bro reader on a given log file
reader = bro_log_reader.BroLogReader('dhcp.log')
Step 3: There is no step 3... for row in reader.readrows():
pprint(row)

<<< Output >>>


Output: Streaming (generator) of {'assigned_ip': '192.168.84.10',
'id.orig_h': '192.168.84.10',
Python dictionaries with the
'id.orig_p': 68,
proper type conversions. 'id.resp_h': '192.168.84.1',
'id.resp_p': 67,
'lease_time': datetime.timedelta(49710, 23000),
'mac': '00:20:18:eb:ca:54',
'trans_id': 495764278,
'ts': datetime.datetime(2012, 7, 20, 3, 14, 12, 219654),
'uid': 'CJsdG95nCNF1RXuN5'}
What’s a Pandas?
Talk Outline
● Big Picture
● Software Bridges
○ Bro to Python
○ Bro to Pandas
○ Pandas to Scikit-Learn
● Example: Anomaly Detection
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
Pandas DataFrames
“Pandas is a Python package providing fast, flexible, and expressive
data structures designed to make working with relational or labeled
data both easy and intuitive. It aims to be the fundamental high-level
building block for doing practical, real world data analysis in Python.”

Demo: Bro To Pandas


Scikit whatcha?
Talk Outline
● Big Picture
● Software Bridges
○ Bro to Python
○ Python to Pandas
○ Pandas to Scikit-Learn
● Example: Anomaly Detection
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
Scikit-Learn
“Scikit-learn is a free software machine learning library for the Python programming
language. It features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient boosting, k-means and
DBSCAN, and is designed to interoperate with the Python numerical and scientific
libraries NumPy and SciPy.”

● We create numpy ndarrays with proper handling of both categorical and


numeric types. Our DataFrameToMatrix class supports fit, fit_transform, and
transform methods.
● Internal maps for categorical ‘one-hot’ encoding and numerical normalization
means that serialization and train/evaluate use cases are supported.

Demo: Bro To Scikit


Talk Outline
One fish is red.. You don’t need
machine learning for that!

● Big Picture
● Software Bridges
○ Bro to Python
○ Python to Pandas
○ Pandas to Scikit-Learn
● Example: Anomaly Detection
○ Bro DNS and HTTP logs
○ Categorical and Numeric Data
○ Clustering
○ Isolation Forests
Anomaly Detection
Popular Mental Images

Popular Misconception: It’s


going to show me ‘bad’ stuff
Anomaly Detection
Just gets you to base camp...
~.01%: Possibly Malicious (Recommender System)
Interesting ~1%: Interesting traffic (Organization + User Feedback)
Base Camp Anomalous ~5%: Anomalous traffic (Anomaly Detection)

~95%: Normal network traffic that can


Normal Network be filtered out early in the pipeline
Traffic

Raw Network Traffic 100%: All Traffic (unknown mix)


Normal to Anomalous
Anomaly Detection
Bro IDS Output
Anomalous
DataFrame
Normal Network
Example: 1M HTTP Logs to
Traffic Matrix
Conversion 10k anomalous rows *

Challenges: I-Forests
● Streaming Data
● Data Volume Anomalous Output:
● Categorical and Numerical Types DNS/HTTP ● 1-5% of data
● Efficient DataFrame/Matrix conversions ● Uncommon (by def)
* http://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb ● Good Base Camp
Isolation Forests: Anomaly Detection

9 Divisions (not anomalous) 4 Divisions (anomalous)

https://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb
Anomalous to Interesting
Organization + User Feedback
Anomalous Example: 10k rows clustered and
DNS/HTTP organized for displayed to user *
Interesting

Organization and
Anomalous Clustering

Display and
Challenges: Feedback*
● Streaming Data
● Organization and Clustering
Interesting
● Engaging the Human Output:
● User Interface and Feedback* ● Fraction of 1%-5%
● Clustered/organized
* Feedback will be used in the next phase of the pipeline
* http://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb
● Ready for Feedback*
Demo: Anomaly Detection

https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Scikit.ipynb
https://github.com/Kitware/bat/blob/master/notebooks/Anomaly_Detection.ipynb
Demo: Bro to Kafka to Spark

https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Kafka_to_Spark.ipynb
Demo: Bro to Parquet to Spark

https://github.com/Kitware/bat/blob/master/notebooks/Bro_to_Parquet_to_Spark.ipynb
Questions/Comments?

You might also like