Report - Mini Project

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 66

PRICE PREDICTION ON LAPTOPS USING MACHINE

LEARNING
Project Report
Submitted By

KEERTHIVASAN.K
20CS0910

In partial fulfilment for the award of the degree


Of
BACHELOR OF COMPUTER SCIENCE

DEPARTMENT OF COMPUTER SCIENCE


ACHARIYA ARTS AND SCIENCE COLLEGE
(Recognized under sec-2f of the UGC Act 1956)
(Affiliated to Pondicherry University)
PUDUCHERRY – 605 110
NOVEMBER – 2022

1
ACHARIYA ARTS AND SCIENCE COLLEGE
(Recognized under sec-2f of the UGC Act 1956)
(Affiliated to Pondicherry University)
VILLIANUR, PUDUCHERRY – 605 110

DEPARTMENT OF COMPUTER SCIENCE


BONAFIDE CERTIFICATE
This is to certify that the Project Report entitled “PRICE
PREDICTION ON LAPTOPS USING MACHINE LEARNING” is the
bonafide record of done by Keerthivasan(20CS0910), B.Sc(Computer
Science) in partial fulfillment of the the requirements of the Bachelor
Degree in Computer Science during the year of 2020-2023.

INTERNAL GUIDE HEAD OF THE DEPARTMENT

Submitted to the University Examinations held on …………………… at


Achariya Arts And Science College, Villianur, Puducherry – 605 110.

INTERNAL EXAMINER EXTERNAL EXAMINER

2
ACKNOWLEDGEMENT
It is my esteemed privilege to express my gratitude to almighty and respect to
all those who have guided me and inspired me during the course of project. First and
foremost I express my sincere gratitude to Chief Mentor Mr. J.ARAWINDHAN.

I am highly indebted to Dr. V. SHANMUGARAJA, Principal Achariya Arts


and Science College, Pondicherry, for his kind content to carry out this project and
given an opportunity to do our graduation in this reputed institution.

I take immense pleasure in conveying our sincere and heart full gratitude to
Mr. R. MURUGADOSS M.C.A., Head of the Department of Computer science,
Achariya Arts and Science College. Puducherry for giving permission to do project
and providing us to enrich and complete the task with more valuable information. I
am greatly indebted to our guide Mrs. M. YASOTHAPRIYA, Assistant Professor.
Department of Computer Science for her encouragement and valuable suggestion
throughout the study phase, implementation phase and preparation of project.

I would be failing in our duty, if I don't express our gratitude to our parents,
whose smiling faces and benevolence made this endeavour, a pleasure to succeed.
Finally with love and affection I thank all who have rendered their support for
successful completion of this Project report

3
DECLARATION
We hereby declare that the Project Report entitled "PRICE PREDICTION ON
LAPTOPS USING MACHINE LEARNING" is submitted to Achariya Arts and
Science College affiliated to Pondicherry University, in partial fulfilment of the
requirements for the award of the degree of "BACHELOR OF COMPUTER
SCIENCE" is a record of Project work done by us during the period September-
November 2022 under the supervision and guidance of Mrs. M. YASOTHAPRIYA,
Asst.Professor, Department of computer science.

PLACE: PUDUCHERRY, KEERTHIVASAN.K(20CS0910)


DATE:

4
ABSTRACT
This paper presents a Laptop price prediction system by using the supervised machine
learning technique. The research uses random decision forest as the machine learning
prediction method which offered 88.7% prediction precision. Using random decision
forest, there are multiple independent variables but one and only one dependent
variable whose actual and predicted values are compared to find precision of results.
This paper proposes a system where price is dependent variable which is predicted,
and this price is derived from factors like Laptop’s model, RAM, ROM (HDD/SSD),
GPU, CPU, IPS Display, and Touch Screen.

5
TABLE OF CONTENTS

TITLE PAGE
NO
1. INTRODUCTION
1.1 BACKGROUND INTRODUCTION 8
1.2 ABOUT DATASET 9
1.3 RELATED TOPICS TO THIS PROJECT
1.3.1 MACHINE LEARNING 9

1.3.2 FEATURE ENGINEERING 13

1.3.3 DATA PRE-PROCESSING 15

1.3.4 LINEAR REGRESSION 15

1.3.5 RIDGE REGRESSION 16

1.3.6 LASSO REGRESSION 17

1.3.7 DECISION TREES 17

1.3.8 RANDOM FOREST 18

1.3.9 HYPERPARAMETER TUNING 19

1.4 PROBLEM STATEMENT 19

1.5 RELATED WORKS 20

1.6 GOALS AND OBJECTIVES 20

1.7 REPORT ORGANIZATION


1.7.1 CHAPTER 1 – INTRODUCTION 20

1.7.2 CHAPTER 2 – LITERATURE REVIEW 21

1.7.3 CHAPTER 3 – REQUIREMENT ANALYSIS 21

1.7.4 CHAPTER 4 – SYSTEM ARCHITECTURE 21

1.7.5 CHAPTER 5 – METHODOLOGY 21

1.7.6 CHAPTER 6 – IMPLEMENTATION 21

1.7.7 CHAPTER 7 – RESULT 22

1.7.8 CHAPTER 8 – CONCLUSION 22


2. LITERATURE REVIEW
23
2.1 LOADING THE DATA
23
2.2 PREPROCESSING THE DATA AND DATA
VISUALIZATION
24
2.3 MODEL BUILDING

6
2.4 UI INTEGRATION AND DEPLOYMENT 24
2.5 ADVANTAGES OF LPP APP 25
2.6 LIMITATIONS OF LPP APP 26

3. REQUIREMENT ANALYSIS
3.1 PROJECT REQUIREMENTS
3.1.1 HARDWARE REQUIREMENTS 27

3.1.2 SOFTWARE REQUIREMENTS 27

4. SYSTEM ARCHITECTURE
30
4.1 SYSTEM METHODOLOGY
30
4.2 USE CASE DIAGRAM
31
4.3 SEQUENCE MODEL
31
4.4 ACTIVITY DIAGRAM
32
4.5 COLLOBORATION DIAGRAM
32
4.6 SYSTEM OVERFLOW
5. METHODOLOGY
33
5.1 LOADING THE DATA
33
5.2 CLEANING THE DATA
33
5.3 EXPLORATORY DATA ANALYSIS
37
5.4 MODEL BUILDING
38
5.5 WEB DEVELOPMENT
38
5.6 DEPLOYMENT
6. IMPLEMENTATION 39
7. RESULT 61
8. CONCLUSION 63
REFERENCES 64

7
CHAPTER 1. INTRODUCTION
1.1 BACKGROUND INTRODUCTION
Laptop price prediction especially when the laptop is coming direct from the factory
to Electronic Market/ Stores, is both a critical and important task. The mad rush that
we saw in 2020 for laptops to support remote work and learning is no longer there. In
India, demand of Laptops soared after the Nationwide lockdown, leading to 4.1-
Million-unit shipments in the June quarter of 2021, the highest in the five years. Even
now in 2022, most companies are adapting to remote work environment which made a
high demand for laptops. Accurate Laptop price prediction involves expert knowledge
because price usually depends on many distinctive features and factors. Typically,
most significant ones are brand and model, RAM, ROM, GPU, CPU, etc. In this
project, we applied different methods and techniques to achieve higher precision of
the used laptop price prediction.

Figure 1

8
1.2 ABOUT DATASET
We got the dataset from Kaggle. Most of the columns in this dataset are noisy and
contain lots of information. But with feature engineering, we can get more good
results. The only problem is we are having less data, but we can obtain a good
accuracy over it. The only good thing is it is better to have a large data. we have
developed a website that could predict a tentative price of a laptop based on user
configuration.

Figure 2

1.3 RELATED TOPICS TO THIS PROJECT


1.3.1 MACHINE LEARNING
In the real world, we are surrounded by humans who can learn everything from their
experiences with their learning capability, and we have computers or machines which
work on our instructions. But can a machine also learn from experiences or past data
like a human does? So here comes the role of Machine Learning.

9
Figure 3

Machine Learning is said as a subset of artificial intelligence that is mainly


concerned with the development of algorithms which allow a computer to learn from
the data and past experiences on their own. The term machine learning was first
introduced by Arthur Samuel in 1959. We can define it in a summarized way as:

“Machine learning enables a machine to automatically learn from data, improve


performance from experiences, and predict things without being explicitly
programmed. “

With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine learning brings computer
science and statistics together for creating predictive models. Machine learning
constructs or uses the algorithms that learn from historical data. The more we will
provide the information, the higher will be the performance.

A machine has the ability to learn if it can improve its performance by gaining
more data.

10
A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The
accuracy of predicted output depends upon the amount of data, as the huge amount of
data helps to build a better model which predicts the output more accurately.

Suppose we have a complex problem, where we need to perform some predictions, so


instead of writing a code for it, we just need to feed the data to generic algorithms,
and with the help of these algorithms, machine builds the logic as per the data and
predict the output. Machine learning has changed our way of thinking about the
problem. The below block diagram explains the working of Machine Learning
algorithm:

Features of Machine Learning:

 Machine learning uses data to detect various patterns in each dataset.


 It can learn from past data and improve automatically.
 It is a data-driven technology.
 Machine learning is much like data mining as it also deals with the huge
amount of the data.

Need for Machine Learning

The need for machine learning is increasing day by day. The reason behind the need
for machine learning is that it can do tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the
huge amount of data manually, so for this, we need some computer systems and here
comes the machine learning to make things easy for us.

11
We can train machine learning algorithms by providing them the huge amount of data
and let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of
machine learning, we can save both time and money.

The importance of machine learning can be easily understood by its use-cases,


Currently, machine learning is used in self-driving cars, cyber fraud detection, face
recognition, and friend suggestion by Facebook, etc. Various top companies such as
Netflix and Amazon have built machine learning models that are using a vast amount
of data to analyse the user interest and recommend product accordingly.

Importance of Machine Learning:

 Rapid increment in the production of data


 Solving complex problems, which are difficult for a human
 Decision making in various sector including finance
 Finding hidden patterns and extracting useful information from data.

Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

12
1.3.2 FEATURE ENGINEERING
Feature engineering refers to manipulation — addition, deletion, combination,
mutation — of THE data set to improve machine learning model training,
leading to better performance and greater accuracy. Effective feature engineering
is based on sound knowledge of the business problem and the available data sources.

Creating new features gives you a deeper understanding of your data and results in
more valuable insights. When done correctly, feature engineering is one of the most
valuable techniques of data science, but it is also one of the most challenging. A
common example of feature engineering is when your doctor uses your body mass
index (BMI). BMI is calculated from both body weight and height and serves as a
surrogate for a characteristic that is very hard to accurately measure: the proportion of
lean body mass.

13
Some common types of feature engineering include:

 Scaling and normalization mean adjusting the range and centre of data to
ease learning and improve the interpretation of the results.
 Filling missing values implies filling in null values based on expert
knowledge, heuristics, or by some machine learning techniques. Real-world
datasets can be missing values due to the difficulty of collecting complete
datasets and because of errors in the data collection process.
 Feature selection means removing features because they are unimportant,
redundant, or outright counterproductive to learning. Sometimes you simply
have too many features and need fewer.
 Feature coding involves choosing a set of symbolic values to represent
different categories. Concepts can be captured with a single column that
comprises multiple values, or they can be captured with multiple columns,
each of which represents a single value and has a true or false in each field.
For example, feature coding can indicate whether a particular row of data was
collected on a holiday. This is a form of feature construction.
 Feature construction creates a new feature(s) from one or more other
features. For example, using the date you can add a feature that indicates the
day of the week. With this added insight, the algorithm could discover that
certain outcomes are more likely on a Monday or a weekend.
 Feature extraction means moving from low-level features that are unsuitable
for learning — practically speaking, you get poor testing results — to higher-
level features that are useful for learning. Often feature extraction is valuable
when you have specific data formats — like images or text — that must be
converted to a tabular row-column, example-feature format.

14
1.3.3 DATA PRE-PROCESSING
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine
learning model. When creating a machine learning project, it is not always a case that
we come across the clean and formatted data. And while doing any operation with
data, it is mandatory to clean it and put in a formatted way. So, for this, we use data
pre-processing task.

A real-world data generally contains noises, missing values, and maybe in an


unusable format which cannot be directly used for machine learning models. Data
pre-processing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a
machine learning model.

1.3.4 LINEAR REGRESSION


Linear regression quantifies the relationship between one or more predictor
variable(s) and one outcome variable. Linear regression is commonly used for
predictive analysis and modelling. For example, it can be used to quantify the relative
impacts of age, gender, and diet (the predictor variables) on height (the outcome
variable).  Linear regression is also known as multiple regression, multivariate
regression, ordinary least squares (OLS), and regression. This post will show you
examples of linear regression, including an example of simple linear regression and
an example of multiple linear regression.

15
1.3.5 RIDGE REGRESSION
Ridge regression is a model tuning method that is used to analyse any data that suffers
from multicollinearity. This method performs L2 regularization. When the issue of
multicollinearity occurs, least-squares are unbiased, and variances are large, this
results in predicted values being far away from the actual values. 

16
1.3.6 LASSO REGRESSION
Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is
where data values are shrunk towards a central point, like the mean. The lasso
procedure encourages simple, sparse models (i.e., models with fewer parameters).
This regression is well-suited for models showing high levels of multicollinearity or
when you want to automate certain parts of model selection, like variable
selection/parameter elimination.

1.3.7 DECISION TREES


Decision Trees (DTs) are a non-parametric supervised learning method used for
classification and regression. The goal is to create a model that predicts the value of a
target variable by learning simple decision rules inferred from the data features. A tree
can be seen as a piecewise constant approximation.

For instance, in the example below, decision trees learn from data to approximate a
sine curve with a set of if-then-else decision rules. The deeper the tree, the more
complex the decision rules and the fitter the model.

17
1.3.8 RANDOM FOREST
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of
combining multiple classifiers to solve a complex problem and to improve the
performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of


decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one decision
tree, the random forest takes the prediction from each tree and based on the majority
votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.

18
The below diagram explains the working of the Random Forest algorithm:

1.3.9 HYPERPARAMETER TUNING


Hyperparameter tuning is an essential part of controlling the behaviour of a machine
learning model. If we don’t correctly tune our hyperparameters, our estimated model
parameters produce suboptimal results, as they don’t minimize the loss function. This
means our model makes more errors. In practice, key indicators like the accuracy or
the confusion matrix will be worse.

1.4 PROBLEM STATEMENT


The problem statement is that if any user wants to buy a laptop, then our application
should be compatible to provide a tentative price of laptop according to the user
configurations. Although it looks like a simple project or just developing a model, the
dataset we have is noisy and needs lots of feature engineering, and pre-processing that
will drive your interest in developing this project.

19
1.5 RELATED WORKS
Predicting price of laptops has been studied extensively in various research. Listian
discussed, in her paper written for Master thesis, that regression model that was built
using Decision Tree & Random Forest Regressor can predict the price of a laptop that
has been leased with better precision than multivariate regression or some simple
multiple regression. This is on the grounds that Decision Tree Algorithm is better in
dealing with datasets with more dimensions and it is less prone to overfitting and
underfitting. The weakness of this research is that a change of simple regression with
more advanced Decision Tree Algorithm regression was not shown in basic indicators
like mean, variance, or standard deviation.

1.6 GOALS AND OBJECTIVES


Our goal in this project is to help consumers to get the predicted price of the laptops
with their configuration likings. To achieve we have several objective points. They
are,
 Knowing the dataset and getting an overview of it.
 Data Cleaning.
 Exploratory Data Analysis (EDA) on significant features of dataset.
 Feature Engineering to make our data meaningful.
 Building ML models.
 Developing a web application.
 Deploying our web application.

1.7 REPORT ORGANIZATION


1.7.1 CHAPTER 1 – INTRODUCTION
In this chapter, we have given a conceptual explanation of the project. Then described
about the problem statement and explained about all the related topics that are used in
this project. Next, we provided the goals and objectives of this project.

20
1.7.2 CHAPTER 2 – LITERATURE REVIEW
This chapter purely focuses on what kind of methods are used to clean and transform
our data (Data Pre-processing & Feature Engineering), what ML techniques are used
in model building (Linear Regression, Ridge Regression, Lasso Regression, Decision
Trees, Random Forest), what platform (Heroku) is used to deploy our model in form
of a web application (using Streamlit). We have checked the MAE score and R2 score
to see which algorithm technique is good for predicting the prices.

1.7.3 CHAPTER 3 - REQUIREMENT ANALYSIS


In this chapter, we will show what requirements are needed on both hardware and
software basis. What libraries will be needed to achieve our project objectives. Our
library requirement heavily needed on Data visualization, Data Cleaning, Model
Building. To demonstrate the Jupyter Notebook more attractive we have used some
libraries to change themes. Next, we have done a complete feasibility study on
technological and business perspectives.

1.7.4 CHAPTER 4 – SYSTEM ARCHITECTURE


In this chapter, we will explain the complete workflow of the project with use of
Block Diagrams.

1.7.5 CHAPTER 5 – METHODOLGY


In this chapter, we explained how we cleaned our data and how we done Exploratory
Data Analysis to analyse our data, how we used feature engineering to make our data
more meaningful. And we have used various ML algorithm techniques to show
comparison of which one is giving best accuracy. How we did hyperparameter tuning
to increase the accuracy of prediction.

1.7.6 CHAPTER 6 – IMPLEMENTATION


In this chapter, we have given source code of our notebook and code for web
application. And we have explained what we have done on each phase.

21
1.7.7 CHAPTER 7 - RESULT
This chapter focuses on the output of our web application. We have provided
screenshots of the sample output. And will provide the graphs of the two versions of
random forest regressor’s prediction output. One version has normal hyperparameters
and second one has improvised hyperparameters. By Two graphs we compared the
smoothness of the graph (Between actual and predicted).

1.7.8 CHAPTER 8 - CONCLUSION


This chapter mainly focuses on justifying how our web app is useful and what kind of
limitations it has, how we can improve it in the future.

22
CHAPTER 2 - LITERATURE REVIEW
2.1 LOADING THE DATA
We got our dataset from Kaggle. Kaggle allows users to find and publish data sets,
explore, and build models in a web-based data-science environment, work with other
data scientists and machine learning engineers, and enter competitions to solve data
science challenges. As a ML and Data Science aspirants, we always take our liking on
Kaggle, and that’s where we found this dataset.

2.2 PREPROCESSING THE DATA AND DATA VISUALIZATION


We did some data cleaning to use it for our model building, it had some uncertain
features that is needed for our project. So, we pre-processed the data in the way where
we can use it efficiently. Then comes our data visualization, which is mainly done on
EDA phase for distinct features of the dataset. We used Feature Engineering to pre-
process our data.

23
2.3 MODEL BUILDING
To train our model, we have used the advantage of scikit-learn libraries. We used
various ML algorithms like,
 Linear models
1. Linear Regression
2. Ridge Regression
3. Lasso Regression
 Ensemble models
1. Decision Tress
2. Random Forest
So, we used this many models, to see the accuracy rate and selected Random Forest
model, which gave around 88.5% accuracy. But we didn’t stop there, we did
Hyperparameter tuning to set the right parameters for the algorithm, which made the
accuracy to go from 88.5% to 88.7%. It may not look like a much of a increase, but
while predicting the whole dataset, from the skewness of the graph, we can see that
how many samples, its actually taking in, so it’s a greater increase according to it.

2.4 UI INTEGRATION AND DEPLOYMENT


So, we done all our code on the notebook, I have pickled two files, one has the
pipeline to our Random Forest model. And another one has the trained data set.
For creating the UI for the web app, we are using Streamlit. It’s an open source app
framework in Python language. It helps us create web apps for data science and
machine learning in a short time. It is compatible with major Python libraries such
as scikit-learn, Keras, PyTorch, SymPy(latex), NumPy, pandas, Matplotlib etc.

24
After writing our code for web application, we have pushed the necessary files to the
Git Repository. Then we used Heroku to deploy our web application. Heroku is a
container-based cloud Platform as a Service (PaaS). Developers use Heroku to
deploy, manage, and scale modern apps. This platform is elegant, flexible, and easy
to use, offering developers the simplest path to getting their apps to market.

2.5 ADVANTAGES OF LPP APP


 It’s an easy-to-use application, where customers can select the configurations
they want, and then tap submit. They will get the price range in a second
 Our application’s accuracy rate is high compared to other existing works, and
we have designed the UI in way, that it’s easily readable.
 Using our app, there is no need for customers to worry about setting their
budget on the dream laptops, especially for the students.

25
2.6 LIMITATIONS OF LPP APP
 Our app does not cover every single feature in a laptop, so if someone wants to
get the laptop price with configuration of more interior details, then it’s not the
app they are looking for.
 Our app is not 100% accurate, because if any certain conditions like discounts
and offers are given by the vendor, then the price range may differ.
 It’s not a portable android application, but a web application, so its not flexible
for some customers who hates browsers.

26
CHAPTER 3 – REQUIREMENT ANALYSIS

3.1 PROJECT REQUIREMENTS


This project requires both hardware/software requirements, it gives priorities to the
software, since our whole project does not depend on any specific hardware, other
than the necessary ones.

3.1.1 HARDWARE REQUIREMENTS


The basic hardware requirements are,
 CPU PROCESSER – i5 (above 8th gen is preferable)
 RAM – 8GB
 WIRELESS ADAPTOR – for connecting to internet to deploy and also for
download and installing required libraries
 GPU – not needed in some cases, but to boost the performance, at least IRIS
graphics need to be used.

3.1.2 SOFTWARE REQUIREMENTS


Let’s breakdown our software necessities, we can do that like,
 BASIC REQUIREMENTS
 CODE EDITORS
 LIBRARIES USED
 SOFTWARE USED TO DEPLOY

BASIC REQUIREMENTS:
 Windows 10 or above
 Any supporting browser
 Python

27
CODE EDITORS:
 JUPYTER NOTEBOOK:
Jupyter Notebook allows users to compile all aspects of a data
project in one place making it easier to show the entire process of a
project to your intended audience. Through the web-based
application, users can create data visualizations and other components
of a project to share with others via the platform. Here we used this to
do Data Cleaning, EDA, and model building
 VISUAL STUDIO CODE:
Visual Studio Code is a streamlined code editor with support for
development operations like debugging, task running, and version
control. It aims to provide just the tools a developer needs for a quick
code-build-debug cycle and leaves more complex workflows to fuller
featured IDEs, such as Visual Studio IDE.

LIBRARIES USED
 PANDAS:
Pandas is an open-source Python package that is most widely used for
data science/data analysis and machine learning tasks. It is built on
top of another package named NumPy, which provides support for
multi-dimensional arrays.
 NUMPY:
NumPy can be used to perform a wide variety of mathematical
operations on arrays. It adds powerful data structures to Python that
guarantee efficient calculations with arrays and matrices, and it
supplies an enormous library of high-level mathematical functions that
operate on these arrays and matrices.
 SEABORN:
Seaborn is a library that uses Matplotlib underneath to plot graphs. It
will be used to visualize random distributions.

28
 MATPLOTLIB:
Matplotlib is one of the plotting libraries in python which is however
widely in use for machine learning application with its numerical
mathematics extension- NumPy to create static, animated, and
interactive visualizations.
 SCIKIT-LEARN:
scikit-learn is an open-source Python library that implements a
range of machine learning, pre-processing, cross-validation, and
visualization algorithms using a unified interface. Important features of
scikit-learn: Simple and efficient tools for data mining and data
analysis.
 STEAMLIT:
Streamlit can seamlessly integrate with other popular python libraries
used in Data science such as NumPy, Pandas, Matplotlib, Scikit-
learn and many more. Note: Streamlit uses React as a frontend
framework to render the data on the screen.

SOFTWARE USED FOR DEPLOYMENT:


 HEROKU:
Heroku is an ecosystem of cloud services, which can be used to
instantly extend applications with fully managed services. Using an
existing, high-quality service is something that empowers developers -
they can build more, faster, by using trusted services that provide the
functionality that they require.

29
CHAPTER 4 – SYSTEM ARCHITECTURE
4.1 SYSTEM METHODOLOGY

4.2 USE CASE DIAGRAM

30
4.3 SEQUENCE MODEL

4.4 ACTIVITY DIAGRAM

31
4.5 COLLOBORATION DIAGRAM

4.6 SYSTEM OVERFLOW

32
CHAPTER 5 – METHODOLOGY

5.1 LOADING THE DATA


After importing our basic dependencies, we have loaded the dataset using pandas
library. There are few steps, of getting the dataset and loading it the data frame.
 STEP 1: Downloading the dataset from Kaggle.
 STEP 2: Next installing the pandas library.
 STEP 3: Importing the pandas library.
 STEP 4: Using pandas we will create a Data Frame.
 STEP 5: After creating the Data Frame, we will load the dataset, which in the
form of .CSV file (comma separated values) into the Data Frame.

33
5.2 CLEANING THE DATA
The dataset we loaded is still noisy and raw, so we are removing the unnecessary
columns and by changing the object types as per our usage. Also, we did some string
manipulation.

5.3 EXPLORATORY DATA ANALYSIS


We followed a sequence of steps to analyse every single feature of our dataset.

 PRICE
1. Viewing the distribution of the price column
2. Creating plots for categorical variables
3. Plots for average price for each of the laptop brands, which will tell us
the insight that as per company the price of the laptop vary
4. Plots for various types of laptops
5. Laptop type and variation about price
6. Variation of inches towards the price

 SCREEN RESOLUTION
For the Screen Resolution column, we have many steps of Screen Resolutions
out there where Touch Screen and Normal and IPS Panel are the 3 parts on
basis of which we can segregate the things
1. STEP 1: Creating a new column, touchscreen if the value is 1 that
laptop is touch screen
2. STEP 2: Touch Screen on comparison with price of laptop
3. STEP 3: Creating a new column named IPS, does the laptop have IPS
facility or not
4. STEP 4: Price variation with respect to the IPS column
5. STEP 5: Splitting the text “X” letter and separating the 2 parts, where
one of the columns is Y resolution and we need to do some feature
engineering on the X resolution column

34
6. STEP 6: So basically from that whole text of the X_res column, we
need to extract the digits from it, but the problem is the numbers are
scattered in some cases, that is the reason why we are using regex, if
we use this we will exactly get the numbers which we are looking for,
so firstly replacing all the “,” with “” and then finding all numbers
from that string as “\d+\.?”, \d means that integer number and \.? All
the numbers which come after a number and \d+ the string must end
with number.
7. STEP 7: Creating heatmap for correlation plot.
8. STEP 8: The correlation plot will show increase in X_res and Y_res.
The price of the laptop is also increasing, so X_res and Y_res is
positively correlated, and they are giving much information, so that is
the reason why we had split Resolution column into X_res and Y_res
columns respectively.

9. STEP 9: So, to make things good, we can create a new column named
PPI {pixels per inch}, now as the correlation plot shows that X_res and
Y_res is having much collinearity, so why not combine them with
Inches which is having less collinearity, so we will combine them as
per the formula shown below

 CPU
1. STEP 1: Most common processors are made by intel right, so we will
be clustering their processors into different categories like i5, i7, other,

35
now other means the processors of intel which do not have i3, i5 or i7
attached to it, they’re completely different so that’s the reason we
clustered them into other and other category is AMD which is a
different category in whole.
2. STEP 2: We need to extract the first 3 words of the CPU column, as
the first 3 words of every row under the CPU column is the type of the
CPU, which we are going to use.
3. STEP 3: If we get any of the intel ‘i3, i5 or i7’ versions we will return
them as it is, if we get any other processor, we will first check whether
is that a variant of intel or not. If yes, we will tag it as “Other Intel
Processor” else we will say it’s a “AMD Processor”
4. STEP 4: Price vs Processor variation
5. STEP 5: Dropping the CPU column

 RAM
We will separate the Type of memory and the value of it, just like the one
which is done in the previous part. This part involves things which are needed
to be done in steps, so here we do not have the memory as a complete we have
it in different dimensions as 128GB SSD + 1TB HDD, so to for it come in a
same dimension we need to do some modifications.

1. STEP 1: Relating ram with Price column


2. STEP 2: Counting the different categories and different kinds of
variations.
3. STEP 3: From the observed variants, we will remove the decimal
space for example 1.0 TB will be 1 TB
4. STEP 4: Then replace the GB word with “”

36
5. STEP 5: replace the TB word with “000”
6. STEP 6: Splitting the word across the “+” character
7. STEP 7: Stripping all the white spaces, basically eliminating white
space
8. STEP 8: Removing all the characters but keeping the numbers
9. STEP 9: Multiplying the elements and storing the result in subsequent
columns
10. STEP 10: Dropping unnecessary columns

 GPU
Here we are having less data regarding the laptops, its better that we focus on
GPU brands instead focusing on the values which are present there beside
them, we will focus on the brands.
1. STEP 1: Counting the values in GPU column
2. STEP 2: Extracting the brands
3. STEP 3: Removing the “ARM” tuple
4. STEP 4: Used median to check if there is any impact of outlier or not

So, after analysing everything, if we apply logarithm on price column, we can get the
Gaussian Distribution.

5.4 MODEL BUILDING


In model building, first thing we will do splitting the dataset into training set and test
set. There’s a Class which we imported named as Column Transformer we use this
widely while building our models using Pipelines, so for this we must get the index
numbers of the columns which are having categorical variables. Before that we will
create a hash map that has all the features in json format. We will be using several
algorithms to check the best accuracy.

 LINEAR REGRESSION
 RIDGE REGRESSION

37
 LASSO REGRESSION
 DECISION TREES
 RANDOM FOREST

In all these algorithms, we will apply OneHotEncoding on the columns with sample
indices. The remainder we keep as passthrough. No other column must get effected
except the ones undergoing the transformation.

After getting the result that, Random Forest has best accuracy rate, we will do
Hyperparameter Tuning on Random Forest. If we did not specify the max depth, the
tree structure will be more complex get the sampling. So, we used ccp_alpha
hyperparameter, which is nothing but Cost complexity pruning which provides option
to control the size of a tree. Greater values of ccp_alpha increase the number of nodes
pruned.

Next, we will be predicting on whole dataset. We will retransform our logarithmic


format of price back to exp form to get the result. We created two versions; we did
leave some miscalculations to not to let the model converges which leads to saturation
where our model does not train on new data but give out the result that it had when
converged. So, what we did is, we used the predicted set as the new data and
increased the accuracy which doesn’t affect the logic, we used for hyperparameter
tuning. By doing this, we achieved 88.5% to 88.7% accuracy.

38
5.5 WEB APP DEVELOPMENT
First thing, download and installing the library Streamlit. After that, we will unpickle
the files we pickled earlier. After that we will use the predicted data as our data over
in the app development. Now we created input boxes for each feature of laptop for the
user to make their own configurations.

5.6 DEPLOYMENT
After getting ready with web app code, we have pushed it to the Git repository. We
deployed our web application on Heroku platform and maintaining through Git
environment.

CHAPTER 6 – IMPLEMENTATION
1. Importing Basic Dependencies

2. Loading data into Data Frame

39
3. Cleaning the data

4. Categorical and Numerical variables

40
5. Exploratory Data Analysis

41
42
43
44
45
46
Extracting X resolution and Y resolution for Screen Resolution feature

47
48
CPU analysis

49
RAM analysis

50
GPU analysis

51
52
Operating System Analysis

53
54
Before applying logarithm to the Price feature

After applying logarithm to the Price feature

55
6. Model Building
Importing Basic dependencies and splitting the dataset

Creating hash map of all distinct features

Linear Regression

56
Ridge Regression

Lasso Regression

Decision Trees

57
Random Forest

Hyperparameter Tuning

58
Prediction with version 1

Prediction with version 2

Pickling on our pipeline

59
Pickling on our predicted data

7. Web Application Development

60
8. GIT REPOSITORY

Deployment

61
CHAPTER 7 – RESULT
So, after the deployment of our we application, we can go to the website and enter our
custom configurations to know the predicted price.

62
The output of our model gives a smoother prediction line than the existing ones.

63
CHAPTER 8 – CONCLUSION
Predicting something through the application of machine learning using the Random
Forest algorithm makes it easy for students, especially in determining the choice of
laptop specifications that are most desirable for students to meet student needs and in
accordance with the purchasing power of students. Students no longer need to look for
various sources to find laptop specifications that are needed by students in meeting
the needs of students, because the laptop specifications from the results of the
machine learning application have provided the most desirable specifications with
their prices of laptops.

64
REFERENCES
[1]. Sorower MS. A literature survey on algorithms for multi-label learning. Oregon
State University, Corvallis. 2010 Dec;18.
[2]. Pandey M, Sharma VK. A decision tree algorithm pertaining to the student
performance analysis and prediction. International Journal of Computer Applications.
2013 Jan 1;61(13).
[3]. Priyama A, Abhijeeta RG, Ratheeb A, Srivastavab S. Comparative analysis of
decision tree classification algorithms. International Journal of Current Engineering
and Technology. 2013 Jun;3(2):334-7
[4]. Streamlit.io, Kaggle.com, Wikipedia.com
[5]. Ho, T. K. (1995, August). Random decision forests. In Document analysis and
recognition, 1995., proceedings of the third international conference on (Vol. 1, pp.
278-282).
[6]. Weka 3 - Data Mining with Open Source Machine Learning Software in Java.
(n.d.), Retrieved from: https://www.cs.waikato.ac.nz/ml/weka/. [August 04, 2018].
[7]. Noor, K., & Jan, S. (2017). Vehicle Price Prediction System using Machine
Learning Techniques. International Journal of Computer Applications, 167(9), 27-31.
[8]. Pudaruth, S. (2014). Predicting the price of used cars using machine learning
techniques. Int. J. Inf. Comput. Technol, 4(7), 753-764.
[9]. Listiani, M. (2009). Support vector regression analysis for price prediction in a
car leasing application (Doctoral dissertation, Master thesis, TU Hamburg-Harburg).
[10].Agencija za statistiku BiH. (n.d.), retrieved from: http://www.bhas.ba . [accessed
July 18, 2018.]
[11].Utku A, Hacer (Uke) Karacan, Yildiz O, Akcayol MA. Implementation of a New
Recommendation System Based on Decision Tree Using Implicit Relevance
Feedback. JSW. 2015 Dec.

65
66

You might also like