t2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Introduction:

The Loan Approval Classification Dataset is a collection of financial data used to


predict whether a loan application will be approved or denied. It includes
information about the applicant's demographics, employment, income, and credit
history. This dataset is commonly used for training machine learning models to
automate the loan approval process.

The objective of the Loan Approval Classification Dataset is to predict whether a


loan application will be approved or denied based on various financial and
demographic factors. This dataset is used to train and evaluate machine learning
models that can automate the loan approval process, making it more efficient and
accurate.

1. Handling Missing Values

The script first looks for missing values in the dataset using some handy R
functions like is.na() and colSums(). It calculates the total number of
missing values and even breaks it down column by column. For example:
Once the missing values are identified, the script offers a couple of solutions:

 It uses na.omit() to completely remove rows with missing values if


necessary.
 For more refined handling, it fills missing values with things like the
median of the column. For example:

So, it makes sure the missing data doesn’t mess up the analysis.

2. Visualizing Missing Data

To make things clearer, the script uses graphs to visualize where data is missing.
This helps in understanding the problem better. It uses two tools:

1. Naniar Library: With gg_miss_var(), it creates a simple graph to


show how many missing values each column has.
2. VIM Library: The aggr() function makes a neat plot, coloring missing
values in red and complete ones in blue. This is like a visual heatmap of the
missing data.

Here’s an example:
3. Balancing the Dataset

The dataset might have an imbalance in how many rows belong to each class, for
example, in loan_status. If one class (like "Approved") has way more rows
than another (like "Rejected"), it can skew the analysis.

The script fixes this using the ROSE library. It tries three methods:

1. Oversampling: Adding more rows to the smaller class.


2. Undersampling: Removing rows from the larger class.
3. Combination (Both): A mix of oversampling and undersampling.

For instance:
Now, the data is balanced, making the analysis fairer.

4. Removing Duplicate Data

Duplicates in a dataset can distort results. The script identifies rows that are
repeated using the duplicated() function. For example:
Once duplicates are identified, the script removes them. It can also find duplicates
based on specific columns, like person_age,person_gender, person_gender and
person_education.

This ensures that the data is clean and has no redundancy.

5. Filtering the Data

The script demonstrates several ways to filter rows of data based on conditions.
For example:

 To find only male participants and to filter for people aged between 20 and
30:
This is useful when you want to focus on specific segments of the data.

6. Converting Between Categorical and Numeric Data

Sometimes, categorical data like "Male" and "Female" needs to be turned into
numbers (0 and 1) for better analysis. It also turns education levels (like
"Bachelor", "Master") into numbers:
This makes the data easier to work with for statistical models.

7. Normalization
Normalization is when you scale numeric values so they fit within a specific range,
like 0 to 1. The script normalizes person_income using min-max scaling:

This is useful when different attributes have very different scales.

8. Handling Outliers and Invalid Data

Outliers, or values that are way too high or low compared to the rest, can mess up
the analysis. The script identifies outliers using the Interquartile Range (IQR)
method.
 First, it calculates the lower and upper bounds:

 Then, it replaces outliers with the median income:


For invalid data (like negative ages), it simply identifies and removes or fixes
them.

You might also like