t2
t2
t2
The script first looks for missing values in the dataset using some handy R
functions like is.na() and colSums(). It calculates the total number of
missing values and even breaks it down column by column. For example:
Once the missing values are identified, the script offers a couple of solutions:
So, it makes sure the missing data doesn’t mess up the analysis.
To make things clearer, the script uses graphs to visualize where data is missing.
This helps in understanding the problem better. It uses two tools:
Here’s an example:
3. Balancing the Dataset
The dataset might have an imbalance in how many rows belong to each class, for
example, in loan_status. If one class (like "Approved") has way more rows
than another (like "Rejected"), it can skew the analysis.
The script fixes this using the ROSE library. It tries three methods:
For instance:
Now, the data is balanced, making the analysis fairer.
Duplicates in a dataset can distort results. The script identifies rows that are
repeated using the duplicated() function. For example:
Once duplicates are identified, the script removes them. It can also find duplicates
based on specific columns, like person_age,person_gender, person_gender and
person_education.
The script demonstrates several ways to filter rows of data based on conditions.
For example:
To find only male participants and to filter for people aged between 20 and
30:
This is useful when you want to focus on specific segments of the data.
Sometimes, categorical data like "Male" and "Female" needs to be turned into
numbers (0 and 1) for better analysis. It also turns education levels (like
"Bachelor", "Master") into numbers:
This makes the data easier to work with for statistical models.
7. Normalization
Normalization is when you scale numeric values so they fit within a specific range,
like 0 to 1. The script normalizes person_income using min-max scaling:
Outliers, or values that are way too high or low compared to the rest, can mess up
the analysis. The script identifies outliers using the Interquartile Range (IQR)
method.
First, it calculates the lower and upper bounds: