Project1 Report1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Predicting Housing Prices Using Machine Learning

Divyansh Shah Subhanjan Das Yuvraj Shand Abhi Tyagi


Faculty of Business Faculty of Business Faculty of Business Faculty of Business
Humber Institute of Technology Humber Institute of Technology Humber Institute of Technology Humber Institute of Technology
and Learning and Learning and Learning and Learning
Toronto, Canada Toronto, Canada Toronto, Canada Toronto, Canada
[email protected] [email protected] [email protected] [email protected]

Hardi Patel
Faculty of Business
Humber Institute of Technology and Learning
Toronto, Canada
[email protected]

Abstract—The housing market is undoubtedly one of the most In this paper, we attempt to construct realistic models using
highly invested asset class of our generation and hence it is regression and evaluate their performances and efficiencies, to
critically significant to keep enhancing the currently used models accurately estimate the value of real estate[3]. We also
and algorithms. In this study we examine the key variables that discovered several relevant factors that directly influence the
influencing the house prices and forecast these prices with the help price of a property and to what extent.
of Machine Learning models like Multiple Linear Regression and
k-NearestNeighbours (kNN) we also evaluate the performances of II. DATA
our models. Although the difference is not significant, it was
discovered that the multiple liner regression model consistently The dataset consists of 21 columns and 21613 rows that
outperformed the k-NN algorithm. Though only applicable for combine a total entry of 453,873 entries.
simple predictions, the models described in our work yielded Table 1. Data Description
exceptionally low Mean Absolute Error (MAE) and Root Mean
Squared Error (RMSE). Variable Description Data Type
price Sale Price for the Numeric
Keywords—Multiple Linear Regression, k-NearestNeighbours,
houses
Machine Learning.
bedrooms No. of bedrooms Ordinal
I. INTRODUCTION in the house
The purchase of a property is easily the most significant bathrooms No. of bathrooms Ordinal
financial investment for the majority of the population. in the house
purchase typically follows an extensive decision-making
process that involves enormous amounts of research[1]. For the sqft_liv Size of the living Numeric
majority of home buyers, this research proves to be a challenge room
due to the lack of information and knowledge regarding the sqft_lot Size of thhe Numeric
housing market and the overall economic state[1]. Properties entire lot
and assets are generally overestimated due to this lack of floors Type of flooring Ordinal
knowledge and a multitude of other factors which induce an in the house
imbalance in property prices and the housing market[1].
Predicting the prices of these properties is hence an extremely waterfront Property facing Categorical
challenging research avenue the house market is influenced by water body
several mutually correlated factors[1]. Human behavior also view Rating of the Ordinal
plays an indistinguishable role in determining the price of a view from the
property in an area or a locality[1]. Price prediction models and house
algorithms are of critical significance as extremely high value
transactions are dependent on the models, due to integration of condition The condition of Ordinal
Machine Learning and Artificial Intelligence with the banks, the house at the
asset management firms and big businesses[1]. It is therefore time of viewing
indispensable to conduct substantial studies by adopting newer
methodologies and approaches instead of the contemporary grade Rating of Ordinal
methods[1]. building
construction and
design
sqft_above Aside from the Numeric
basement, square
feet above ground
sqft_basmt square feet Numeric
underground
yr_built How old the Numeric
house is
yr_renov When was the Numeric
house renovated
squft_liv15 Average size of Numeric
living area
squft_lot15 Area of land lots Numeric

There are 15 integer type columns, 5 are float type and there
is an object column. We do not have any NULL values.
Figure 2: Correlation Heat Map
ID, date, zip code, lat, long columns have been removed as
they are not useful for the modeling and would have consumed Before performing any regression to predict variables, the
additional memory. dataset went through an exhaustive search algorithm using
backward elimination, forward selection, and stepwise selection.
The following table describes the minimum, mean, This allowed us to split the dataset into training and validation
standard deviation, median, maximum for the continuous dataset. Python function was created to fit and find the AIC
variable in the Housing dataset: score[2]. This was done to predict which variables would be
beneficial for the regression model. The backward elimination
price sqft_living sqft_lot sqft_abovesqft_basement
sqft_living15
sqft_lot15 method suggested to remove ‘sqft_living’ and ‘floors’.
Min 75,000 290 520 290 - 399 651 However, the correlation of ‘sqft_living’ with ‘price’ column is
MEAN 540,088 2,080 15,107 1,788 292 1,987 12,768 very high. Thus, it cannot be removed. The next step was using
Std 367,127 918 41,421 828 443 685 27,304
the forward elimination method and this method suggested to
Median 450,000 1,910 7,618 1,560 - 1,840 7,620
Max 7,700,000 13,540 1,651,359 9,410 4,820 6,210 871,200
not include ‘sqft_basement’ and ‘floors’ in the further analysis.
Figure 1: Description of numerical columns Similarly, stepwise selection method also suggested the same
result.
III. WORKING Following the suggestions, the linear regression model was
This work aims to build a linear regression model and a k- fitted using 60% training data and 40% testing data. The trained
Nearest Neighbors model to predict the housing prices and data was further used to predict the price of the houses in the
suggest which among the two models gives better predictions. testing dataset.
The dataset consists of 21 columns and 21613 records, with a
Next , K-Nearest Neighbours Model was used to predict the
brief overview of the dataset we were able to assess those 5
price of the houses. In the model, the value of K was taken as 5
columns i.e., ‘id’, ‘date’, ‘zipcode’, ‘lat’, and ‘long’ do not
and 10.
contribute much to the price of the house in the current dataset.
The next step was to remove or correct any outliers that IV. RESULT
existed in the dataset. It consisted of only one outlier that existed
in the ‘bedrooms’ column. The house with id 2402100895 had Table 1. Results
a typographical error and the value of that parameter was Linear K Nearest K Nearest
changed and not removed. Regression Neighbours Neighbours
(K = 5) (K = 10)
As there were no categorical columns with good correlation
with the output variable ‘price’, no dummy variables were Mean 151751.64 158357.06 153539.86
created. Few of the houses were renovated that could have Absolu
influenced their price. Thus, a new variable was created that te
represented if the house was renovated or not. If the house was Error
renovated, it had value as 1 else 0 and the original column
‘yr_renovated’ was deleted from the dataset. Mean 58443560506. 70345440921. 68241853246.
Square 06 10 20
d Error
Root 241751.03 265227.15 261231.41 ACKNOWLEDGMENT
Mean We extend our sincere gratitude to Humber Institute of
Square Technology and Learning for guiding us throughout the process
d Error of this work.
REFERENCES
As mentioned in table 1, it depicts that the linear regression
model has the lowest MAE, MSE and RMSE at 151751.64, [1] O'Farrell, S. (2018). House Price Prediction. Comparison of Data Mining
58443560506.06 and 241751.03 respectively. The model is Models to Predict House Prices.
considered a good model if their corresponding MAE and [2] Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2020). Data mining
RMSE are low. Therefore, this model is best suitable for for Business Analytics: Concepts, techniques and applications in Python.
predicting the price of the houses. John Wiley & Sons, Inc.
[3] Qingqi Zhang, "Housing Price Prediction Based on Multiple Linear
Among the k-NN model, the k=10 regression model had the Regression", Scientific Programming, vol. 2021, Article ID 7678931, 9
lowest RMSE at 261231.41, thus confirming that with 10 nearest pages, 2021
neighbours the model performed better.

You might also like