Fraud Ebook Latest - Databricks PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Detecting Financial Fraud at

Scale with Decision Trees and


MLflow on Databricks
A Databricks guide, including code samples
and notebooks.
Introduction
Detecting fraudulent patterns at scale is a challenge, no
matter the use case. The massive amounts of data to
sift through, the complexity of the constantly evolving
techniques, and the very small number of actual examples
of fraudulent behavior are comparable to finding a needle
in a haystack while not knowing what the needle looks
like. In the world of finance, the added concerns with
security and the importance of explaining how fraudulent
behavior was identified further increases the complexity of
the task.
To build these detection patterns, a team of domain experts
often comes up with a set of rules that define fraudulent
behavior. A typical workflow may include a subject matter
expert in the financial fraud detection space putting together
a set of requirements for a particular behavior. A data scientist
may then take a subsample of the available data and build a
model using these requirements and possibly some known
fraud cases. To put the pattern in production, a data engineer
may convert the resulting model to a set of rules with
thresholds, often implemented using SQL.

2
This approach allows the financial institution to present
a clear set of characteristics that led to the identification
of fraud that is compliant with the General Data
Protection Regulation (GDPR). However, this approach
also poses numerous difficulties. The implementation
of the detection pattern using a hardcoded set of rules
is very brittle. Any changes to the pattern would take a
very long time to update. This, in turn, makes it difficult
to keep up with and adapt to the shift in fraudulent
behaviors that are happening in the current marketplace.

Additionally, the systems in the workflow described In this eBook, we will showcase how to convert several
above are often siloed, with the domain experts, data such rule-based detection use cases to machine learning
scientists, and data engineers all compartmentalized. use cases on the Databricks platform, unifying the key
The data engineer is responsible for maintaining massive players in fraud detection: domain experts, data scientists,
amounts of data and translating the work of the domain and data engineers. We will learn how to create a fraud-
experts and data scientists into production level code. detection data pipeline and visualize the data leveraging a
Due to a lack of common platform, the domain experts framework for building modular features from large data
and data scientists have to rely on sampled down data sets. We will also learn how to detect fraud using decision
that fits on a single machine for analysis. This leads to trees and Apache Spark MLlib. We will then use MLflow to
difficulty in communication and ultimately a lack of iterate and refine the model to improve its accuracy.
collaboration.

3
1. The lack of training labels,
SOLVING WITH ML
There is a certain degree of reluctance with regard to machine learning 2. The decision of what features to use, and
models in the financial world as they are believed to offer a “black 3. Having an appropriate benchmark for the model.
box” solution with no way of justifying the identified fraudulent cases.
Training a machine learning model to recognize the rule-based fraudulent
GDPR requirements, as well as financial regulations, make it seemingly
behavior flags offers a direct comparison with the expected output via a
impossible to leverage the power of machine learning. However, several
confusion matrix. Provided that the results closely match the rule-based
successful use cases have shown that applying machine learning to detect
detection pattern, this approach helps gain confidence in machine learning
fraud at scale can solve a host of the issues mentioned.
based fraud detection with the skeptics. The output of this model is very
easy to interpret and may serve as a baseline discussion of the expected
false negatives and false positives when compared to the original detection
pattern.

Furthermore, the concern with machine learning models being difficult


to interpret may be further assuaged if a decision tree model is used as
the initial machine learning model. Because the model is being trained
to a set of rules, the decision tree is likely to outperform any other
machine learning model. The additional benefit is, of course, the utmost
transparency of the model, which will essentially show the decision-
making process for fraud, but without human intervention and the need
to hard code any rules or thresholds. Of course, it must be understood
that the future iterations of the model may utilize a different algorithm
Training a supervised machine learning model to detect financial fraud
altogether to achieve maximum accuracy. The transparency of the model
is very difficult due to the low number of actual confirmed examples of
is ultimately achieved by understanding the features that went into the
fraudulent behavior. However, the presence of a known set of rules that
algorithm. Having interpretable features will yield interpretable and
identify a particular type of fraud can help create a set of synthetic labels
defensible model results.
and an initial set of features. The output of the detection pattern that has
been developed by the domain experts in the field has likely gone through The biggest benefit of the machine learning approach is that after the
the appropriate approval process to be put in production. It produces initial modeling effort, future iterations are modular and updating the set
the expected fraudulent behavior flags and may, therefore, be used as of labels, features, or model type is very easy and seamless, reducing the
a starting point to train a machine learning model. This simultaneously time to production. This is further facilitated on the Databricks Unified
mitigates three concerns: Analytics Platform where the domain experts, data scientists, data
engineers may work off the same data set at scale and collaborate directly
in the notebook environment. So let’s get started!
4
Ingesting and Exploring the Data
We will use a synthetic dataset for this example. To load the dataset In addition to reducing operational friction, Databricks is a central
yourself, please download it to your local machine from Kaggle and location to run the latest Machine Learning models. Users can leverage
then import the data via Import Data – Azure and AWS the native Spark MLLib package or download any open source Python or
R ML package. With Databricks Runtime for Machine Learning, Databricks
clusters are preconfigured with XGBoost, scikit-learn, and numpy as
The PaySim data simulates mobile money transactions based on a
well as popular Deep Learning frameworks such as TensorFlow, Keras,
sample of real transactions extracted from one month of financial logs
Horovod, and their dependencies.
from a mobile money service implemented in an African country. The
below table shows the information that the data set provides:
In this eBook, we will explore how to:

• Import our sample data source to create a Databricks table

• Explore your data using Databricks Visualizations

• Execute ETL code against your data

• Execute ML Pipeline including model tuning XGBoost Logistic Regression

5
EXPLORING THE DATA TYPES OF TRANSACTIONS
Creating the DataFrames – Now that we have uploaded the data
Let’s visualize the data to understand the types of transactions the data
to Databricks File System (DBFS), we can quickly and easily create
captures and their contribution to the overall transaction volume.
DataFrames using Spark SQL

# Create df DataFrame which contains our simulated financial


fraud detection dataset
df = spark.sql(“select step, type, amount, nameOrig,
oldbalanceOrg, newbalanceOrig, nameDest, oldbalanceDest,
newbalanceDest from sim_fin_fraud_detection”)

Now that we have created the DataFrame, let’s take a look at the schema
and the first thousand rows to review the data.

# Review the schema of your data


df.printSchema()
root
|-- step: integer (nullable = true)
|-- type: string (nullable = true)
|-- amount: double (nullable = true)
|-- nameOrig: string (nullable = true)
|-- oldbalanceOrg: double (nullable = true)
|-- newbalanceOrig: double (nullable = true)
To get an idea of how much money we are talking about, let’s also visualize
|-- nameDest: string (nullable = true)
|-- oldbalanceDest: double (nullable = true) the data based on the types of transactions and on their contribution to the
|-- newbalanceDest: double (nullable = true)
amount of cash transferred (i.e. sum(amount)).

6
RULES-BASED MODEL VISUALIZING DATA FLAGGED BY RULES
We are not likely to start with a large data set of known fraud cases to train These rules often flag quite a large number of fraudulent cases. Let’s
our model. In most practical applications, fraudulent detection patterns visualize the number of flagged transactions. We can see that the rules flag
are identified by a set of rules established by the domain experts. Here, we about 4% of the cases and 11% of the total dollar amount as fraudulent.
create a column called label based on these rules.

# Rules to Identify Known Fraud-based


df = df.withColumn(“label”,
F.when(
(
(df.oldbalanceOrg <= 56900) & (df.type ==
“TRANSFER”) & (df.newbalanceDest <= 105)) | ( (df.oldbalanceOrg
> 56900) & (df.newbalanceOrig <= 12)) | ( (df.oldbalanceOrg >
56900) & (df.newbalanceOrig > 12) & (df.amount > 1160000)
), 1
).otherwise(0))

After this ETL process is completed, you can use the display command
again to review the cleansed data in a scatterplot.

# View bar graph of our data


display(loan_stats)

7
Selecting the Appropriate Machine
Learning Models
In many cases, a black box approach to fraud detection cannot be CREATING THE TRAINING SET
used. First, the domain experts need to be able to understand why a To build and validate our ML model, we will do an 80/20 split using
transaction was identified as fraudulent. Then, if action is to be taken, .randomSplit. This will set aside a randomly chosen 80% of the data for
the evidence has to be presented in court. The decision tree is an easily training and the remaining 20% to validate the results.
interpretable model and is a great starting point for this use case. Read
this blog “The wise old tree” on decision trees to learn more.

# Split our dataset between training and test datasets


(train, test) = df.randomSplit([0.8, 0.2], seed=12345)

CREATING THE TRAINING SET


To prepare the data for the model, we must first convert categorical variables
to numeric using .StringIndexer. We then must assemble all of the
features we would like for the model to use. We create a pipeline to contain
these feature preparation steps in addition to the decision tree model so
that we may repeat these steps on different data sets. Note that we fit the
pipeline to our training data first and will then use it to transform our test
data in a later step.

8
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier

# Encodes a string column of labels to a column of label indices


indexer = StringIndexer(inputCol = “type”, outputCol =
“typeIndexed”)

# VectorAssembler is a transformer that combines a given list of


columns into a single vector column
va = VectorAssembler(inputCols = [“typeIndexed”, “amount”,
“oldbalanceOrg”, “newbalanceOrig”, “oldbalanceDest”,
“newbalanceDest”, “orgDiff”, “destDiff”], outputCol =
“features”)

# Using the DecisionTree classifier model


dt = DecisionTreeClassifier(labelCol = “label”, featuresCol =
“features”, seed = 54321, maxDepth = 5)

# Create our pipeline stages


pipeline = Pipeline(stages=[indexer, va, dt])

# View the Decision Tree model (prior to CrossValidator)


dt_model = pipeline.fit(train)

Visual representation of the Decision Tree model


VISUALIZING THE MODEL
Calling display() on the last stage of the pipeline, which is the decision
tree model, allows us to view the initial fitted model with the chosen
decisions at each node. This helps to understand how the algorithm
arrived at the resulting predictions.

display(dt_model.stages[-1])

9
MODEL TUNING MODEL PERFORMANCE
To ensure we have the best fitting tree model, we will cross-validate the We evaluate the model by comparing the Precision-Recall (PR) and Area
model with several parameter variations. Given that our data consists of under the ROC curve (AUC) metrics for the training and test sets. Both PR
96% negative and 4% positive cases, we will use the Precision-Recall (PR) and AUC appear to be very high.
evaluation metric to account for the unbalanced distribution.

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder # Build the best model (training and test datasets)
train_pred = cvModel_u.transform(train)
# Build the grid of different parameters test_pred = cvModel_u.transform(test)
paramGrid = ParamGridBuilder() \
.addGrid(dt.maxDepth, [5, 10, 15]) \ # Evaluate the model on training datasets
.addGrid(dt.maxBins, [10, 20, 30]) \ pr_train = evaluatorPR.evaluate(train_pred)
.build() auc_train = evaluatorAUC.evaluate(train_pred)

# Build out the cross validation # Evaluate the model on test datasets
crossval = CrossValidator(estimator = dt, pr_test = evaluatorPR.evaluate(test_pred)
estimatorParamMaps = paramGrid, auc_test = evaluatorAUC.evaluate(test_pred)
evaluator = evaluatorPR,
numFolds = 3) # Print out the PR and AUC values
# Build the CV pipeline print(“PR train:”, pr_train)
pipelineCV = Pipeline(stages=[indexer, va, crossval]) print(“AUC train:”, auc_train)
print(“PR test:”, pr_test)
# Train the model using the pipeline, parameter grid, and print(“AUC test:”, auc_test)
preceding BinaryClassificationEvaluator
cvModel_u = pipelineCV.fit(train) ---
# Output:
# PR train: 0.9537894984523128
# AUC train: 0.998647996459481
# PR test: 0.9539170535377599
# AUC test: 0.9984378183482442

To see how the model misclassified the results, let’s use matplotlib and
pandas to visualize our confusion matrix.

10
# Reset the DataFrames for no fraud (`dfn`) and fraud (`dfy`)
dfn = train.filter(train.label == 0)
dfy = train.filter(train.label == 1)

# Calculate summary metrics


N = train.count()
y = dfy.count()
p = y/N

# Create a more balanced training dataset


train_b = dfn.sample(False, p, seed = 92285).union(dfy)

# Print out metrics


print(“Total count: %s, Fraud cases count: %s, Proportion of
fraud cases: %s” % (N, y, p))
print(“Balanced training dataset count: %s” % train_b.count())

---
# Output:
# Total count: 5090394, Fraud cases count: 204865, Proportion of
fraud cases: 0.040245411258932016
# Balanced training dataset count: 401898
---

# Display our more balanced training dataset


display(train_b.groupBy(“label”).count())
BALANCING THE CLASSES
We see that the model is identifying 2421 more cases than the original rules
identified. This is not as alarming as detecting more potential fraudulent
cases could be a good thing. However, there are 58 cases that were not
detected by the algorithm but were originally identified. We are going
to attempt to improve our prediction further by balancing our classes
using undersampling. That is, we will keep all the fraud cases and then
downsample the non-fraud cases to match that number to get a balanced
data set. When we visualized our new data set, we see that the yes and no
cases are 50/50.

11
UPDATING THE PIPELINE REVIEW THE RESULTS
Now let’s update the ML pipeline and create a new cross validator. Because Now let’s look at the results of our new confusion matrix. The model
we are using ML pipelines, we only need to update it with the new dataset misidentified only one fraudulent case. Balancing the classes seems to
and we can quickly repeat the same pipeline steps. have improved the model.

# Re-run the same ML pipeline (including parameters grid)


crossval_b = CrossValidator(estimator = dt,
estimatorParamMaps = paramGrid,
evaluator = evaluatorAUC,
numFolds = 3)
pipelineCV_b = Pipeline(stages=[indexer, va, crossval_b])

# Train the model using the pipeline, parameter grid, and


BinaryClassificationEvaluator using the `train_b` dataset
cvModel_b = pipelineCV_b.fit(train_b)

# Build the best model (balanced training and full test datasets)
train_pred_b = cvModel_b.transform(train_b)
test_pred_b = cvModel_b.transform(test)

# Evaluate the model on the balanced training datasets


pr_train_b = evaluatorPR.evaluate(train_pred_b)
auc_train_b = evaluatorAUC.evaluate(train_pred_b)

# Evaluate the model on full test datasets


pr_test_b = evaluatorPR.evaluate(test_pred_b)
auc_test_b = evaluatorAUC.evaluate(test_pred_b)

# Print out the PR and AUC values


print(“PR train:”, pr_train_b)
print(“AUC train:”, auc_train_b)
print(“PR test:”, pr_test_b)
print(“AUC test:”, auc_test_b)

---
# Output:
# PR train: 0.999629161563572
# AUC train: 0.9998071389056655
# PR test: 0.9904709171789063
# AUC test: 0.9997903902204509

12
MODEL FEEDBACK AND USING MLFLOW
Once a model is chosen for production, we want to continuously collect
feedback to ensure that the model is still identifying the behavior of interest.
Since we are starting with a rule-based label, we want to supply future
models with verified true labels based on human feedback. This stage
is crucial for maintaining confidence and trust in the machine learning
process. Since analysts are not able to review every single case, we want to
ensure we are presenting them with carefully chosen cases to validate the
model output. For example, predictions, where the model has low certainty,
are good candidates for analysts to review. The addition of this type of
feedback will ensure the models will continue to improve and evolve with
the changing landscape.

MLflow helps us throughout this cycle as we train different model versions.


We can keep track of our experiments, comparing the results of different
model configurations and parameters. For example here, we can compare
the PR and AUC of the models trained on balanced and unbalanced data
sets using the MLflow UI. Data scientists can use MLflow to keep track of
the various model metrics and any additional visualizations and artifacts to
help make the decision of which model should be deployed in production.
The data engineers will then be able to easily retrieve the chosen model
along with the library versions used for training as a .jar file to be deployed
on new data in production. Thus, the collaboration between the domain
experts who review the model results, the data scientists who update the
models, and the data engineers who deploy the models in production, will
be strengthened throughout this iterative process.

13
CONCLUSION
We have reviewed an example of how to use a rule-based fraud
detection label and convert it to a machine learning model using
Databricks with MLflow. This approach allows us to build a scalable,
modular solution that will help us keep up with ever-changing
fraudulent behavior patterns. Building a machine learning model to
identify fraud allows us to create a feedback loop that allows the model
to evolve and identify new potential fraudulent patterns. We have
seen how a decision tree model, in particular, is a great starting point
to introduce machine learning to a fraud detection program due to its
interpretability and excellent accuracy.

A major benefit of using the Databricks platform for this effort is that it
allows for data scientists, engineers, and business users to seamlessly
work together throughout the process. Preparing the data, building
models, sharing the results, and putting the models into production
can now happen on the same platform, allowing for unprecedented
collaboration. This approach builds trust across the previously siloed
teams, leading to an effective and dynamic fraud detection program.

Try this notebook by signing up for a free trial in just a few minutes and
get started creating your own models.

14

You might also like