Report - Mini Project
Report - Mini Project
Report - Mini Project
LEARNING
Project Report
Submitted By
KEERTHIVASAN.K
20CS0910
1
ACHARIYA ARTS AND SCIENCE COLLEGE
(Recognized under sec-2f of the UGC Act 1956)
(Affiliated to Pondicherry University)
VILLIANUR, PUDUCHERRY – 605 110
2
ACKNOWLEDGEMENT
It is my esteemed privilege to express my gratitude to almighty and respect to
all those who have guided me and inspired me during the course of project. First and
foremost I express my sincere gratitude to Chief Mentor Mr. J.ARAWINDHAN.
I take immense pleasure in conveying our sincere and heart full gratitude to
Mr. R. MURUGADOSS M.C.A., Head of the Department of Computer science,
Achariya Arts and Science College. Puducherry for giving permission to do project
and providing us to enrich and complete the task with more valuable information. I
am greatly indebted to our guide Mrs. M. YASOTHAPRIYA, Assistant Professor.
Department of Computer Science for her encouragement and valuable suggestion
throughout the study phase, implementation phase and preparation of project.
I would be failing in our duty, if I don't express our gratitude to our parents,
whose smiling faces and benevolence made this endeavour, a pleasure to succeed.
Finally with love and affection I thank all who have rendered their support for
successful completion of this Project report
3
DECLARATION
We hereby declare that the Project Report entitled "PRICE PREDICTION ON
LAPTOPS USING MACHINE LEARNING" is submitted to Achariya Arts and
Science College affiliated to Pondicherry University, in partial fulfilment of the
requirements for the award of the degree of "BACHELOR OF COMPUTER
SCIENCE" is a record of Project work done by us during the period September-
November 2022 under the supervision and guidance of Mrs. M. YASOTHAPRIYA,
Asst.Professor, Department of computer science.
4
ABSTRACT
This paper presents a Laptop price prediction system by using the supervised machine
learning technique. The research uses random decision forest as the machine learning
prediction method which offered 88.7% prediction precision. Using random decision
forest, there are multiple independent variables but one and only one dependent
variable whose actual and predicted values are compared to find precision of results.
This paper proposes a system where price is dependent variable which is predicted,
and this price is derived from factors like Laptop’s model, RAM, ROM (HDD/SSD),
GPU, CPU, IPS Display, and Touch Screen.
5
TABLE OF CONTENTS
TITLE PAGE
NO
1. INTRODUCTION
1.1 BACKGROUND INTRODUCTION 8
1.2 ABOUT DATASET 9
1.3 RELATED TOPICS TO THIS PROJECT
1.3.1 MACHINE LEARNING 9
6
2.4 UI INTEGRATION AND DEPLOYMENT 24
2.5 ADVANTAGES OF LPP APP 25
2.6 LIMITATIONS OF LPP APP 26
3. REQUIREMENT ANALYSIS
3.1 PROJECT REQUIREMENTS
3.1.1 HARDWARE REQUIREMENTS 27
4. SYSTEM ARCHITECTURE
30
4.1 SYSTEM METHODOLOGY
30
4.2 USE CASE DIAGRAM
31
4.3 SEQUENCE MODEL
31
4.4 ACTIVITY DIAGRAM
32
4.5 COLLOBORATION DIAGRAM
32
4.6 SYSTEM OVERFLOW
5. METHODOLOGY
33
5.1 LOADING THE DATA
33
5.2 CLEANING THE DATA
33
5.3 EXPLORATORY DATA ANALYSIS
37
5.4 MODEL BUILDING
38
5.5 WEB DEVELOPMENT
38
5.6 DEPLOYMENT
6. IMPLEMENTATION 39
7. RESULT 61
8. CONCLUSION 63
REFERENCES 64
7
CHAPTER 1. INTRODUCTION
1.1 BACKGROUND INTRODUCTION
Laptop price prediction especially when the laptop is coming direct from the factory
to Electronic Market/ Stores, is both a critical and important task. The mad rush that
we saw in 2020 for laptops to support remote work and learning is no longer there. In
India, demand of Laptops soared after the Nationwide lockdown, leading to 4.1-
Million-unit shipments in the June quarter of 2021, the highest in the five years. Even
now in 2022, most companies are adapting to remote work environment which made a
high demand for laptops. Accurate Laptop price prediction involves expert knowledge
because price usually depends on many distinctive features and factors. Typically,
most significant ones are brand and model, RAM, ROM, GPU, CPU, etc. In this
project, we applied different methods and techniques to achieve higher precision of
the used laptop price prediction.
Figure 1
8
1.2 ABOUT DATASET
We got the dataset from Kaggle. Most of the columns in this dataset are noisy and
contain lots of information. But with feature engineering, we can get more good
results. The only problem is we are having less data, but we can obtain a good
accuracy over it. The only good thing is it is better to have a large data. we have
developed a website that could predict a tentative price of a laptop based on user
configuration.
Figure 2
9
Figure 3
With the help of sample historical data, which is known as training data, machine
learning algorithms build a mathematical model that helps in making predictions or
decisions without being explicitly programmed. Machine learning brings computer
science and statistics together for creating predictive models. Machine learning
constructs or uses the algorithms that learn from historical data. The more we will
provide the information, the higher will be the performance.
A machine has the ability to learn if it can improve its performance by gaining
more data.
10
A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it. The
accuracy of predicted output depends upon the amount of data, as the huge amount of
data helps to build a better model which predicts the output more accurately.
The need for machine learning is increasing day by day. The reason behind the need
for machine learning is that it can do tasks that are too complex for a person to
implement directly. As a human, we have some limitations as we cannot access the
huge amount of data manually, so for this, we need some computer systems and here
comes the machine learning to make things easy for us.
11
We can train machine learning algorithms by providing them the huge amount of data
and let them explore the data, construct the models, and predict the required output
automatically. The performance of the machine learning algorithm depends on the
amount of data, and it can be determined by the cost function. With the help of
machine learning, we can save both time and money.
1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning
12
1.3.2 FEATURE ENGINEERING
Feature engineering refers to manipulation — addition, deletion, combination,
mutation — of THE data set to improve machine learning model training,
leading to better performance and greater accuracy. Effective feature engineering
is based on sound knowledge of the business problem and the available data sources.
Creating new features gives you a deeper understanding of your data and results in
more valuable insights. When done correctly, feature engineering is one of the most
valuable techniques of data science, but it is also one of the most challenging. A
common example of feature engineering is when your doctor uses your body mass
index (BMI). BMI is calculated from both body weight and height and serves as a
surrogate for a characteristic that is very hard to accurately measure: the proportion of
lean body mass.
13
Some common types of feature engineering include:
Scaling and normalization mean adjusting the range and centre of data to
ease learning and improve the interpretation of the results.
Filling missing values implies filling in null values based on expert
knowledge, heuristics, or by some machine learning techniques. Real-world
datasets can be missing values due to the difficulty of collecting complete
datasets and because of errors in the data collection process.
Feature selection means removing features because they are unimportant,
redundant, or outright counterproductive to learning. Sometimes you simply
have too many features and need fewer.
Feature coding involves choosing a set of symbolic values to represent
different categories. Concepts can be captured with a single column that
comprises multiple values, or they can be captured with multiple columns,
each of which represents a single value and has a true or false in each field.
For example, feature coding can indicate whether a particular row of data was
collected on a holiday. This is a form of feature construction.
Feature construction creates a new feature(s) from one or more other
features. For example, using the date you can add a feature that indicates the
day of the week. With this added insight, the algorithm could discover that
certain outcomes are more likely on a Monday or a weekend.
Feature extraction means moving from low-level features that are unsuitable
for learning — practically speaking, you get poor testing results — to higher-
level features that are useful for learning. Often feature extraction is valuable
when you have specific data formats — like images or text — that must be
converted to a tabular row-column, example-feature format.
14
1.3.3 DATA PRE-PROCESSING
Data pre-processing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine
learning model. When creating a machine learning project, it is not always a case that
we come across the clean and formatted data. And while doing any operation with
data, it is mandatory to clean it and put in a formatted way. So, for this, we use data
pre-processing task.
15
1.3.5 RIDGE REGRESSION
Ridge regression is a model tuning method that is used to analyse any data that suffers
from multicollinearity. This method performs L2 regularization. When the issue of
multicollinearity occurs, least-squares are unbiased, and variances are large, this
results in predicted values being far away from the actual values.
16
1.3.6 LASSO REGRESSION
Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is
where data values are shrunk towards a central point, like the mean. The lasso
procedure encourages simple, sparse models (i.e., models with fewer parameters).
This regression is well-suited for models showing high levels of multicollinearity or
when you want to automate certain parts of model selection, like variable
selection/parameter elimination.
For instance, in the example below, decision trees learn from data to approximate a
sine curve with a set of if-then-else decision rules. The deeper the tree, the more
complex the decision rules and the fitter the model.
17
1.3.8 RANDOM FOREST
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of
combining multiple classifiers to solve a complex problem and to improve the
performance of the model.
The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
18
The below diagram explains the working of the Random Forest algorithm:
19
1.5 RELATED WORKS
Predicting price of laptops has been studied extensively in various research. Listian
discussed, in her paper written for Master thesis, that regression model that was built
using Decision Tree & Random Forest Regressor can predict the price of a laptop that
has been leased with better precision than multivariate regression or some simple
multiple regression. This is on the grounds that Decision Tree Algorithm is better in
dealing with datasets with more dimensions and it is less prone to overfitting and
underfitting. The weakness of this research is that a change of simple regression with
more advanced Decision Tree Algorithm regression was not shown in basic indicators
like mean, variance, or standard deviation.
20
1.7.2 CHAPTER 2 – LITERATURE REVIEW
This chapter purely focuses on what kind of methods are used to clean and transform
our data (Data Pre-processing & Feature Engineering), what ML techniques are used
in model building (Linear Regression, Ridge Regression, Lasso Regression, Decision
Trees, Random Forest), what platform (Heroku) is used to deploy our model in form
of a web application (using Streamlit). We have checked the MAE score and R2 score
to see which algorithm technique is good for predicting the prices.
21
1.7.7 CHAPTER 7 - RESULT
This chapter focuses on the output of our web application. We have provided
screenshots of the sample output. And will provide the graphs of the two versions of
random forest regressor’s prediction output. One version has normal hyperparameters
and second one has improvised hyperparameters. By Two graphs we compared the
smoothness of the graph (Between actual and predicted).
22
CHAPTER 2 - LITERATURE REVIEW
2.1 LOADING THE DATA
We got our dataset from Kaggle. Kaggle allows users to find and publish data sets,
explore, and build models in a web-based data-science environment, work with other
data scientists and machine learning engineers, and enter competitions to solve data
science challenges. As a ML and Data Science aspirants, we always take our liking on
Kaggle, and that’s where we found this dataset.
23
2.3 MODEL BUILDING
To train our model, we have used the advantage of scikit-learn libraries. We used
various ML algorithms like,
Linear models
1. Linear Regression
2. Ridge Regression
3. Lasso Regression
Ensemble models
1. Decision Tress
2. Random Forest
So, we used this many models, to see the accuracy rate and selected Random Forest
model, which gave around 88.5% accuracy. But we didn’t stop there, we did
Hyperparameter tuning to set the right parameters for the algorithm, which made the
accuracy to go from 88.5% to 88.7%. It may not look like a much of a increase, but
while predicting the whole dataset, from the skewness of the graph, we can see that
how many samples, its actually taking in, so it’s a greater increase according to it.
24
After writing our code for web application, we have pushed the necessary files to the
Git Repository. Then we used Heroku to deploy our web application. Heroku is a
container-based cloud Platform as a Service (PaaS). Developers use Heroku to
deploy, manage, and scale modern apps. This platform is elegant, flexible, and easy
to use, offering developers the simplest path to getting their apps to market.
25
2.6 LIMITATIONS OF LPP APP
Our app does not cover every single feature in a laptop, so if someone wants to
get the laptop price with configuration of more interior details, then it’s not the
app they are looking for.
Our app is not 100% accurate, because if any certain conditions like discounts
and offers are given by the vendor, then the price range may differ.
It’s not a portable android application, but a web application, so its not flexible
for some customers who hates browsers.
26
CHAPTER 3 – REQUIREMENT ANALYSIS
BASIC REQUIREMENTS:
Windows 10 or above
Any supporting browser
Python
27
CODE EDITORS:
JUPYTER NOTEBOOK:
Jupyter Notebook allows users to compile all aspects of a data
project in one place making it easier to show the entire process of a
project to your intended audience. Through the web-based
application, users can create data visualizations and other components
of a project to share with others via the platform. Here we used this to
do Data Cleaning, EDA, and model building
VISUAL STUDIO CODE:
Visual Studio Code is a streamlined code editor with support for
development operations like debugging, task running, and version
control. It aims to provide just the tools a developer needs for a quick
code-build-debug cycle and leaves more complex workflows to fuller
featured IDEs, such as Visual Studio IDE.
LIBRARIES USED
PANDAS:
Pandas is an open-source Python package that is most widely used for
data science/data analysis and machine learning tasks. It is built on
top of another package named NumPy, which provides support for
multi-dimensional arrays.
NUMPY:
NumPy can be used to perform a wide variety of mathematical
operations on arrays. It adds powerful data structures to Python that
guarantee efficient calculations with arrays and matrices, and it
supplies an enormous library of high-level mathematical functions that
operate on these arrays and matrices.
SEABORN:
Seaborn is a library that uses Matplotlib underneath to plot graphs. It
will be used to visualize random distributions.
28
MATPLOTLIB:
Matplotlib is one of the plotting libraries in python which is however
widely in use for machine learning application with its numerical
mathematics extension- NumPy to create static, animated, and
interactive visualizations.
SCIKIT-LEARN:
scikit-learn is an open-source Python library that implements a
range of machine learning, pre-processing, cross-validation, and
visualization algorithms using a unified interface. Important features of
scikit-learn: Simple and efficient tools for data mining and data
analysis.
STEAMLIT:
Streamlit can seamlessly integrate with other popular python libraries
used in Data science such as NumPy, Pandas, Matplotlib, Scikit-
learn and many more. Note: Streamlit uses React as a frontend
framework to render the data on the screen.
29
CHAPTER 4 – SYSTEM ARCHITECTURE
4.1 SYSTEM METHODOLOGY
30
4.3 SEQUENCE MODEL
31
4.5 COLLOBORATION DIAGRAM
32
CHAPTER 5 – METHODOLOGY
33
5.2 CLEANING THE DATA
The dataset we loaded is still noisy and raw, so we are removing the unnecessary
columns and by changing the object types as per our usage. Also, we did some string
manipulation.
PRICE
1. Viewing the distribution of the price column
2. Creating plots for categorical variables
3. Plots for average price for each of the laptop brands, which will tell us
the insight that as per company the price of the laptop vary
4. Plots for various types of laptops
5. Laptop type and variation about price
6. Variation of inches towards the price
SCREEN RESOLUTION
For the Screen Resolution column, we have many steps of Screen Resolutions
out there where Touch Screen and Normal and IPS Panel are the 3 parts on
basis of which we can segregate the things
1. STEP 1: Creating a new column, touchscreen if the value is 1 that
laptop is touch screen
2. STEP 2: Touch Screen on comparison with price of laptop
3. STEP 3: Creating a new column named IPS, does the laptop have IPS
facility or not
4. STEP 4: Price variation with respect to the IPS column
5. STEP 5: Splitting the text “X” letter and separating the 2 parts, where
one of the columns is Y resolution and we need to do some feature
engineering on the X resolution column
34
6. STEP 6: So basically from that whole text of the X_res column, we
need to extract the digits from it, but the problem is the numbers are
scattered in some cases, that is the reason why we are using regex, if
we use this we will exactly get the numbers which we are looking for,
so firstly replacing all the “,” with “” and then finding all numbers
from that string as “\d+\.?”, \d means that integer number and \.? All
the numbers which come after a number and \d+ the string must end
with number.
7. STEP 7: Creating heatmap for correlation plot.
8. STEP 8: The correlation plot will show increase in X_res and Y_res.
The price of the laptop is also increasing, so X_res and Y_res is
positively correlated, and they are giving much information, so that is
the reason why we had split Resolution column into X_res and Y_res
columns respectively.
9. STEP 9: So, to make things good, we can create a new column named
PPI {pixels per inch}, now as the correlation plot shows that X_res and
Y_res is having much collinearity, so why not combine them with
Inches which is having less collinearity, so we will combine them as
per the formula shown below
CPU
1. STEP 1: Most common processors are made by intel right, so we will
be clustering their processors into different categories like i5, i7, other,
35
now other means the processors of intel which do not have i3, i5 or i7
attached to it, they’re completely different so that’s the reason we
clustered them into other and other category is AMD which is a
different category in whole.
2. STEP 2: We need to extract the first 3 words of the CPU column, as
the first 3 words of every row under the CPU column is the type of the
CPU, which we are going to use.
3. STEP 3: If we get any of the intel ‘i3, i5 or i7’ versions we will return
them as it is, if we get any other processor, we will first check whether
is that a variant of intel or not. If yes, we will tag it as “Other Intel
Processor” else we will say it’s a “AMD Processor”
4. STEP 4: Price vs Processor variation
5. STEP 5: Dropping the CPU column
RAM
We will separate the Type of memory and the value of it, just like the one
which is done in the previous part. This part involves things which are needed
to be done in steps, so here we do not have the memory as a complete we have
it in different dimensions as 128GB SSD + 1TB HDD, so to for it come in a
same dimension we need to do some modifications.
36
5. STEP 5: replace the TB word with “000”
6. STEP 6: Splitting the word across the “+” character
7. STEP 7: Stripping all the white spaces, basically eliminating white
space
8. STEP 8: Removing all the characters but keeping the numbers
9. STEP 9: Multiplying the elements and storing the result in subsequent
columns
10. STEP 10: Dropping unnecessary columns
GPU
Here we are having less data regarding the laptops, its better that we focus on
GPU brands instead focusing on the values which are present there beside
them, we will focus on the brands.
1. STEP 1: Counting the values in GPU column
2. STEP 2: Extracting the brands
3. STEP 3: Removing the “ARM” tuple
4. STEP 4: Used median to check if there is any impact of outlier or not
So, after analysing everything, if we apply logarithm on price column, we can get the
Gaussian Distribution.
LINEAR REGRESSION
RIDGE REGRESSION
37
LASSO REGRESSION
DECISION TREES
RANDOM FOREST
In all these algorithms, we will apply OneHotEncoding on the columns with sample
indices. The remainder we keep as passthrough. No other column must get effected
except the ones undergoing the transformation.
After getting the result that, Random Forest has best accuracy rate, we will do
Hyperparameter Tuning on Random Forest. If we did not specify the max depth, the
tree structure will be more complex get the sampling. So, we used ccp_alpha
hyperparameter, which is nothing but Cost complexity pruning which provides option
to control the size of a tree. Greater values of ccp_alpha increase the number of nodes
pruned.
38
5.5 WEB APP DEVELOPMENT
First thing, download and installing the library Streamlit. After that, we will unpickle
the files we pickled earlier. After that we will use the predicted data as our data over
in the app development. Now we created input boxes for each feature of laptop for the
user to make their own configurations.
5.6 DEPLOYMENT
After getting ready with web app code, we have pushed it to the Git repository. We
deployed our web application on Heroku platform and maintaining through Git
environment.
CHAPTER 6 – IMPLEMENTATION
1. Importing Basic Dependencies
39
3. Cleaning the data
40
5. Exploratory Data Analysis
41
42
43
44
45
46
Extracting X resolution and Y resolution for Screen Resolution feature
47
48
CPU analysis
49
RAM analysis
50
GPU analysis
51
52
Operating System Analysis
53
54
Before applying logarithm to the Price feature
55
6. Model Building
Importing Basic dependencies and splitting the dataset
Linear Regression
56
Ridge Regression
Lasso Regression
Decision Trees
57
Random Forest
Hyperparameter Tuning
58
Prediction with version 1
59
Pickling on our predicted data
60
8. GIT REPOSITORY
Deployment
61
CHAPTER 7 – RESULT
So, after the deployment of our we application, we can go to the website and enter our
custom configurations to know the predicted price.
62
The output of our model gives a smoother prediction line than the existing ones.
63
CHAPTER 8 – CONCLUSION
Predicting something through the application of machine learning using the Random
Forest algorithm makes it easy for students, especially in determining the choice of
laptop specifications that are most desirable for students to meet student needs and in
accordance with the purchasing power of students. Students no longer need to look for
various sources to find laptop specifications that are needed by students in meeting
the needs of students, because the laptop specifications from the results of the
machine learning application have provided the most desirable specifications with
their prices of laptops.
64
REFERENCES
[1]. Sorower MS. A literature survey on algorithms for multi-label learning. Oregon
State University, Corvallis. 2010 Dec;18.
[2]. Pandey M, Sharma VK. A decision tree algorithm pertaining to the student
performance analysis and prediction. International Journal of Computer Applications.
2013 Jan 1;61(13).
[3]. Priyama A, Abhijeeta RG, Ratheeb A, Srivastavab S. Comparative analysis of
decision tree classification algorithms. International Journal of Current Engineering
and Technology. 2013 Jun;3(2):334-7
[4]. Streamlit.io, Kaggle.com, Wikipedia.com
[5]. Ho, T. K. (1995, August). Random decision forests. In Document analysis and
recognition, 1995., proceedings of the third international conference on (Vol. 1, pp.
278-282).
[6]. Weka 3 - Data Mining with Open Source Machine Learning Software in Java.
(n.d.), Retrieved from: https://www.cs.waikato.ac.nz/ml/weka/. [August 04, 2018].
[7]. Noor, K., & Jan, S. (2017). Vehicle Price Prediction System using Machine
Learning Techniques. International Journal of Computer Applications, 167(9), 27-31.
[8]. Pudaruth, S. (2014). Predicting the price of used cars using machine learning
techniques. Int. J. Inf. Comput. Technol, 4(7), 753-764.
[9]. Listiani, M. (2009). Support vector regression analysis for price prediction in a
car leasing application (Doctoral dissertation, Master thesis, TU Hamburg-Harburg).
[10].Agencija za statistiku BiH. (n.d.), retrieved from: http://www.bhas.ba . [accessed
July 18, 2018.]
[11].Utku A, Hacer (Uke) Karacan, Yildiz O, Akcayol MA. Implementation of a New
Recommendation System Based on Decision Tree Using Implicit Relevance
Feedback. JSW. 2015 Dec.
65
66