Salary Prediction LinearRegression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

hf66etmrt

June 7, 2023

1 Predicting Salary according to Years of experience :


1.0.1 Importing necessary libraries

[66]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

1.0.2 Loading the dataset

[25]: data = pd.read_csv('data.csv') # Loading the dataset and displaying the first 5␣
↪rows

data.head(5)

[25]: YearsExperience Salary


0 1.1 39343.0
1 1.3 46205.0
2 1.5 37731.0
3 2.0 43525.0
4 2.2 39891.0

1.0.3 Exploring the Dataset

[31]: data.shape # Dataset contains 30 rows and 2 columns.

[31]: (30, 2)

[32]: data.info() # Checking information about the datset like columns, Non_null␣
↪values, datatypes.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
# Column Non-Null Count Dtype

1
--- ------ -------------- -----
0 YearsExperience 30 non-null float64
1 Salary 30 non-null float64
dtypes: float64(2)
memory usage: 608.0 bytes

[29]: data.describe()

[29]: YearsExperience Salary


count 30.000000 30.000000
mean 5.313333 76003.000000
std 2.837888 27414.429785
min 1.100000 37731.000000
25% 3.200000 56720.750000
50% 4.700000 65237.000000
75% 7.700000 100544.750000
max 10.500000 122391.000000

The describe() function is a convenient method in pandas that provides a statistical summary of a
DataFrame or Series. It calculates various descriptive statistics for each numerical column in the
dataset, including count, mean, standard deviation, minimum value, 25th percentile (Q1), median
(50th percentile or Q2), 75th percentile (Q3), and maximum value.

[33]: data.isnull().sum() # Checking if the datset contains any null values.

[33]: YearsExperience 0
Salary 0
dtype: int64

[40]: num_duplicates = data.duplicated().sum() # Checking if there is any duplicate␣


↪rows in the dataset.

if num_duplicates > 0:
print(f"The dataset contains {num_duplicates} duplicate values.")
data = data.drop_duplicates()
print("Dropped duplicates.")
print("Number of Duplicate Values after dropping :",num_duplicates)
else:
print("The dataset doesn't contain any duplicate values.")

The dataset doesn't contain any duplicate values.

1.0.4 Preparing the data

[50]: X = data.iloc[:,:-1] # Independent feature


X.head(5)

2
[50]: YearsExperience
0 1.1
1 1.3
2 1.5
3 2.0
4 2.2

[53]: Y = data.iloc[:,-1] # Dependent feature


Y.head(5)

[53]: 0 39343.0
1 46205.0
2 37731.0
3 43525.0
4 39891.0
Name: Salary, dtype: float64

1.0.5 Plotting the data to a look of the data distribution

[54]: plt.scatter(X,Y)
plt.title("Salary according to Experience")
plt.xlabel("Salary")
plt.ylabel("Years of experience")

[54]: Text(0, 0.5, 'Years of experience')

3
1.0.6 Splitting the dataset into train and test

[56]: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33,␣


↪random_state=51)

[57]: print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(20, 1)
(10, 1)
(20,)
(10,)

1.0.7 Training the model

[58]: linear = LinearRegression()

[59]: linear.fit(X_train, Y_train)

4
[59]: LinearRegression()

1.0.8 Checking intercept and coeficient(slope)

[60]: linear.coef_

[60]: array([9523.14578831])

[61]: linear.intercept_

[61]: 24006.035761469633

1.0.9 Testing the model

[62]: Y_pred = linear.predict(X_test)

[64]: Y_pred

[64]: array([106857.40411979, 54480.10228407, 38290.75444394, 102095.83122563,


54480.10228407, 115428.23532927, 70669.4501242 , 80192.59591251,
36386.12528628, 81144.91049134])

[65]: Y_test

[65]: 24 109431.0
8 64445.0
2 37731.0
23 113812.0
7 54445.0
27 112635.0
15 67938.0
18 81363.0
1 46205.0
19 93940.0
Name: Salary, dtype: float64

[70]: score = r2_score(Y_test, Y_pred)


print(f"Score: {score *100}")

Score: 92.78148083974355

1.0.10 Plotting the graph


[79]: # Plotting the scatter plot of actual data points
plt.scatter(X_test, Y_test, color='blue', label='Actual')

# Plotting the predicted line

5
plt.plot(X_test, Y_pred, color='red', linewidth=2, label='Predicted')

plt.title("Salary Prediction")
plt.xlabel("Salary")
plt.ylabel("Years of experience")
plt.legend()

plt.show()

You might also like