Salary Prediction LinearRegression

hf66etmrt
June 7, 2023
1 Predicting Salary according to Years of experience :

1.0.1 Importing necessary libraries
[66]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
1.0.2 Loading the dataset
[25]: data = pd.read_csv('data.csv') # Loading the dataset and displaying the first 5␣
↪rows
data.head(5)
[25]: YearsExperience Salary

0 1.1 39343.0
1 1.3 46205.0
2 1.5 37731.0
3 2.0 43525.0
4 2.2 39891.0
1.0.3 Exploring the Dataset
[31]: data.shape # Dataset contains 30 rows and 2 columns.
[31]: (30, 2)
[32]: data.info() # Checking information about the datset like columns, Non_null␣
↪values, datatypes.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
# Column Non-Null Count Dtype
1
--- ------ -------------- -----
0 YearsExperience 30 non-null float64
1 Salary 30 non-null float64
dtypes: float64(2)
memory usage: 608.0 bytes
[29]: data.describe()
[29]: YearsExperience Salary

count 30.000000 30.000000
mean 5.313333 76003.000000
std 2.837888 27414.429785
min 1.100000 37731.000000
25% 3.200000 56720.750000
50% 4.700000 65237.000000
75% 7.700000 100544.750000
max 10.500000 122391.000000
The describe() function is a convenient method in pandas that provides a statistical summary of a
DataFrame or Series. It calculates various descriptive statistics for each numerical column in the
dataset, including count, mean, standard deviation, minimum value, 25th percentile (Q1), median
(50th percentile or Q2), 75th percentile (Q3), and maximum value.
[33]: data.isnull().sum() # Checking if the datset contains any null values.
[33]: YearsExperience 0
Salary 0
dtype: int64
[40]: num_duplicates = data.duplicated().sum() # Checking if there is any duplicate␣

↪rows in the dataset.
if num_duplicates > 0:
print(f"The dataset contains {num_duplicates} duplicate values.")
data = data.drop_duplicates()
print("Dropped duplicates.")
print("Number of Duplicate Values after dropping :",num_duplicates)
else:
print("The dataset doesn't contain any duplicate values.")
The dataset doesn't contain any duplicate values.
1.0.4 Preparing the data
[50]: X = data.iloc[:,:-1] # Independent feature

X.head(5)
2
[50]: YearsExperience
0 1.1
1 1.3
2 1.5
3 2.0
4 2.2
[53]: Y = data.iloc[:,-1] # Dependent feature

Y.head(5)
[53]: 0 39343.0
1 46205.0
2 37731.0
3 43525.0
4 39891.0
Name: Salary, dtype: float64
1.0.5 Plotting the data to a look of the data distribution
[54]: plt.scatter(X,Y)
plt.title("Salary according to Experience")
plt.xlabel("Salary")
plt.ylabel("Years of experience")
[54]: Text(0, 0.5, 'Years of experience')
3
1.0.6 Splitting the dataset into train and test
[56]: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33,␣

↪random_state=51)
[57]: print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
(20, 1)
(10, 1)
(20,)
(10,)
1.0.7 Training the model
[58]: linear = LinearRegression()
[59]: linear.fit(X_train, Y_train)
4
[59]: LinearRegression()
1.0.8 Checking intercept and coeficient(slope)
[60]: linear.coef_
[60]: array([9523.14578831])
[61]: linear.intercept_
[61]: 24006.035761469633
1.0.9 Testing the model
[62]: Y_pred = linear.predict(X_test)
[64]: Y_pred
[64]: array([106857.40411979, 54480.10228407, 38290.75444394, 102095.83122563,

54480.10228407, 115428.23532927, 70669.4501242 , 80192.59591251,
36386.12528628, 81144.91049134])
[65]: Y_test
[65]: 24 109431.0
8 64445.0
2 37731.0
23 113812.0
7 54445.0
27 112635.0
15 67938.0
18 81363.0
1 46205.0
19 93940.0
Name: Salary, dtype: float64
[70]: score = r2_score(Y_test, Y_pred)

print(f"Score: {score *100}")
Score: 92.78148083974355
1.0.10 Plotting the graph

[79]: # Plotting the scatter plot of actual data points
plt.scatter(X_test, Y_test, color='blue', label='Actual')
# Plotting the predicted line
5
plt.plot(X_test, Y_pred, color='red', linewidth=2, label='Predicted')
plt.title("Salary Prediction")
plt.xlabel("Salary")
plt.ylabel("Years of experience")
plt.legend()
plt.show()

Salary Prediction LinearRegression

Uploaded by

Copyright:

Available Formats

Salary Prediction LinearRegression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Salary Prediction LinearRegression

Uploaded by

Copyright:

Available Formats

hf66etmrt

1 Predicting Salary according to Years of experience :

[66]: import numpy as np

1.0.2 Loading the dataset

[25]: YearsExperience Salary

1.0.3 Exploring the Dataset

[31]: data.shape # Dataset contains 30 rows and 2 columns.

[29]: YearsExperience Salary

[33]: data.isnull().sum() # Checking if the datset contains any null values.

[40]: num_duplicates = data.duplicated().sum() # Checking if there is any duplicate␣

The dataset doesn't contain any duplicate values.

1.0.4 Preparing the data

[50]: X = data.iloc[:,:-1] # Independent feature

[53]: Y = data.iloc[:,-1] # Dependent feature

1.0.5 Plotting the data to a look of the data distribution

[54]: Text(0, 0.5, 'Years of experience')

[56]: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33,␣

1.0.7 Training the model

[58]: linear = LinearRegression()

[59]: linear.fit(X_train, Y_train)

1.0.8 Checking intercept and coeficient(slope)

1.0.9 Testing the model

[62]: Y_pred = linear.predict(X_test)

[64]: array([106857.40411979, 54480.10228407, 38290.75444394, 102095.83122563,

[70]: score = r2_score(Y_test, Y_pred)

1.0.10 Plotting the graph

# Plotting the predicted line

You might also like