PN1 Shakti Akshaya S PDF

House Price Prediction
BUSINESS REPORT
CAPSTONE PROJECT
HOUSE PRICE PREDICTION
Great Learning – PGP- Data Science and Business Analytics
- Shakti Akshaya S
1
Table of contents:
Problem Statement …………………………………………………………………………………………………3

Introduction of Business Problem……………………………………………………………………………3
Data Report…………………………………………………………………………………………………………….4
Column Information……………………………………………………………………………………………….4
Data Types………………………………………………………………………………………………………………5
Duplicates and Missing Values Check……………………………………………………………………..6-9
Summary of data……………………………………………………………………………………………………10
Exploratory Data Analysis………………………………………………………………………………………11
Univariate Analysis………………………………………………………………………………………………11-36
Bivariate analysis……………………………………………………………………………………….………..37-56
Data Processing…………………………………………………………………………………………………..57
Outlier Treatment……………………………………………………………………………………………….57-60
Business Insights………………………………………………………………………………………………….60
2
Problem statement:
A house value is simply more than location and square footage. Like the features that make up a
person, an educated party would want to know all aspects that give a house its value. For
example, you want to sell a house and you don’t know the price which you may expect — it can’t
be too low or too high. To find house price you usually try to find similar properties in your
neighborhood and based on gathered data you will try to assess your house price.
1) Introduction of the business problem
a) Defining problem statement

When any person/business wants to sell or buy a house, they always face this kind of issue as they don't know
the price which they should offer. Due to this they might be offering too low or high for the property. Therefore,
we can analyze the available data of the properties in the area and can predict the price. We need to find how
these attributes influence the house prices Right pricing is very important aspect to sell house. It is very
important to understand what are the factors and how they influence the house price. Objective is to predict the
right price of the house based on the attributes.
The price of a house is dependent on various factors like size or area, how many bedrooms, location, the price of
other houses, and many other factors. Real estate investors would like to find out the actual cost of the house in
order to buy and sell real estate properties. They will lose money when they pay more than the current market
cost or when they sell for less money according to the current market cost. Banks also want to find out about the
current market price of the house when they use someone’s house as collateral for loans. Sometimes the loan
applicant overvalues their house to borrow maximum loan from bank. Local home buyers can also predict the
price of the house to find out if a seller is asking too much. The local seller can also predict their house price and
find out how much is a fair market price.
b) Need of the study/project

To analyse and predict the price of house by using the list of feature variables which are given in the dataset.
The purpose of the project is to have a better understanding of algorithms learnt from the academic lecture by
implementing the algorithms to the specific problem and check if the result matches the expectation. The main
goal for the project is to predict the price of residential homes .
c) Understanding business/social opportunity

As people don't know the features/aspects which cumulate property price, we can provide them House
Buying Selling guiding services in the area so they can buy or sell their property with most suitable
price tag and they didn't lose their hard-earned money by offering low price or keep waiting for
buyers by putting high prices.
3
Q 2. Data Report
a) Understanding how data was collected in terms of time, frequency and methodology
b) Visual inspection of data (rows, columns, descriptive details) c) Understanding of
attributes (variable info, renaming if required).
Answer:
Dimensions of Data:
Number of records and features/aspects we have in the data: (21613, 23)
we have 21613 rows and 23 columns
Let's check out the columns/features we have in the dataset:
From the above we can see the different columns we have in dataset.
These columns provide below information
1. cid: Notation for a house. It's not of our use. So we will drop this column
2. dayhours: Represents Date, when house was sold.
3. price: It's our TARGET feature, that we have to predict based on other featues
4. room_bed: Represents number of bedrooms in a house
5. room_bath: Represents number of bathrooms
6. living_measure: Represents square footage of house
7. lot_measure: Represents square footage of lot
8. ceil: Represents number of floors in house
9. coast: Represents whether house has waterfront view. It seems to be a categorical
variable. We will see in our further data analysis
10. sight: Represents how many times sight has been viewed.
11. condition: Represents the overall condition of the house. It's kind of rating given to the
house.
12. quality: Represents grade given to the house based on grading system
13. ceil_measure: Represents square footage of house apart from basement
14. basement: Represents square footage of basement
15. yr_built: Represents the year when house was built
16. yr_renovated: Represents the year when house was last renovated
17. zipcode: Represents zipcode as name implies
18. lat: Represents Lattitude co-ordniates
19. long: Represents Longitude co-ordinates
4
20. living_measure15: Represents square footage of house, when measured in 2015 year
as house area may or may not have changed after renovation if any happened
21. lot_measure15: Represents square footage of lot, when measured in 2015 year as lot
area may or may not change after renovation if any done
22. furnished: Tells whether house is furnished or not. It seems to be categorical variable as
description implies
23. total_area: Represents total area i.e. area of both living and lot
let's see the data types of the features:
5
Now, we will check for duplicate values and we see that We don't have any duplicate values.
No. of duplicates : 0
Now we will check if we have missing values:
We have 689 missing values.
We will arrange it in an order so that we can easily see which variables has missing values.
6
living_measure15, room_bed, room_bath , sight, lot_measure, ceil , lot_measure15, total_area,

living_measure, ceil_measure, basement, yr_built -- these all are numerical variables , so we will
impute the values by mode accordingly.
When we study the data we have seen that we have some special characters such as $, so we
will replace them with null values and will later treat the null values.
After replacing the characters we have a total of 864 null values.
Now, we have a mix of categorical and numerical variable with null value, we will replace them
with mode.
7
8
When we manually checked the data we saw that there is a data where we can see the
room_bed is 33.
Analysing it we can understand that this is a mistake in terms of data entry.
So we will treat it after univariate and bi variate analysis.
Now, we will do the 5 - factor analysis of the features.
1. CID: House ID/Property ID.Not used for analysis

2. Dayhours: 5 factor analysis is reflecting for this column
3. price: Our taget column value is in 75k - 7700k range. As Mean > Median, it's Right-
Skewed.
4. room_bed: Number of bedrooms range from 0 - 11. As Mean slightly > Median,
it's slightly Right-Skewed.
5. room_bath: Number of bathrooms range from 0 - 8. As Mean slightly < Median,
it's slightly Left-Skewed.
6. living_measure: Square footage of house range from 290 - 13,540. As Mean > Median,
it's Right-Skewed.
7. lot_measure: Square footage of lot range from 520 - 16,51,359. As Mean almost double
of Median, it's Hightly Right-Skewed.
8. ceil: Number of floors range from 1 - 3.5 As Mean ~ Median, it's almost Normal
Distributed.
9. coast: As this value represent whether house has waterfront view or not. It's categorical
column. From above analysis we got know, very few houses has waterfront view.
10. sight: Value ranges from 0 - 4. As Mean > Median, it's Right-Skewed
11. condition: Represents rating of house which ranges from 1 - 5. As Mean > Median,
it's Right-Skewed
12. quality: Representign grade given to house which range from 1 - 13. As Mean > Median,
it's Right-Skewed.
9
13. ceil_measure: Square footage of house apart from basement ranges in 290 - 9,410. As
Mean > Median, it's Right-Skewed.
14. basement: Square footage house basement ranges in 0 - 4,820. As Mean highlty >
Median, it's Highly Right-Skewed.
15. yr_built: House built year ranges from 1900 - 2015. As Mean < Median, it's Left-
Skewed.
16. yr_renovated: House renovation year only 2015. So this column can be used
as Categorical Variable for knowing whether house is renovated or not.
17. zipcode: House ZipCode ranges from 98001 - 98199. As Mean > Median, it's Right-
Skewed.
18. lat: Lattitude ranges from 47.1559 - 47.7776 As Mean < Median, it's Left-Skewed.
19. long: Longittude ranges from -122.5190 to -121.315 As Mean > Median, it's Right-
Skewed.
20. living_measure15: Value ragnes from 399 to 6,210. As Mean > Median, it's Right-
Skewed.
21. lot_measure15: Value ragnes from 651 to 8,71,200. As Mean highly > Median,
it's Highly Right-Skewed.
22. furnished: Representing whether house is furnished or not. It's a Categorical Variable
23. total_area Total area of house ranges from 1,423 to 16,52,659. As Mean is almost
double of Median, it's Highly Right-Skewed
From above analysis we got to know,
Most columns distribution is Right-Skewed and only few features are Left-Skewed (like
room_bath, yr_built, lat).
We have columns which are Categorical in nature are -> coast, yr_renovated, furnished
10
Q 3. Exploratory Data Analysis

a) Univariate analysis (distribution and spread for every continuous attribute, distribution of
data in categories for categorical ones) b) Bivariate analysis (relationship between different
variables , correlations) a) Removal of unwanted variables (if applicable) b) Missing Value
treatment (if applicable) d) Outlier treatment (if required) e) Variable transformation (if
applicable) f) Addition of new variables (if required)
Exploratory Data Analysis
Let's do some visual data analysis of the features
Univariate Analysis - By BoxPlot
11
12
13
14
15
16
17
18
19
20
21
We can see, there are lot of features which have outliers. So we might need to treat those before
building model
Analyzing Feature: cid
cid - CID is appearing muliple times, it seems data contains house which is sold multiple times
We have 176 properties that were sold more than once in the given data
Analyzing Feature: dayhours

we created new data frame that can be used for modeling
We converted the dayhours to 'month_year' as sale month-year is relevant for analysis
We successfully converted dayhours feature to month_year for better analysis.
22
We can see, most houses sold in April, July month
So the time line of the sale data of the properties is from May-2014 to May-2015 and April month
have the highest mean price.
Analyzing Feature: Price (our Target)
23
The Price is ranging from 75,000 to 77,00,000 and distribution is right-skewed.
Analyzing Feature: room_bed
24
The value of 33 seems to be outlier we need to check the data point before imputing
the same
Will delete this data point after bivariate analysis as it looks to be an outlier as it has low price for
33 bed room property
Most of the houses/properties have 3 or 4 bedrooms
Analyzing Feature: room_bath
25
Majority of the properties have bathroom in the range of 1.0 to 2.5
Skewness is : 0.5051364190311984
Analyzing Feature: Living measure
26
Data distribution tells us, living_measure is right-skewed.
Plotting the boxplot for living_measure
27
There are many outliers in living measure. Need to review further to treat the same.
We will check the number of data points with Living measure greater than 8000,
We have only 9 properties/house which have more than 8k living_measure. So will treat these
outliers.
Analyzing Feature: lot_measure
Data is skewed as visible from plot.
28
Have checked the no. of data points with Lot measure greater than 1250000.
We have only 1 property with more than 12,50,000 lot_measure. So we need to treat this
Analyzing Feature: ceil

let's see the ceil count for all the records
29
We can see, most houses have 1 floor
Above graph confirming the same, that most properties have 1 and 2 floors
Analyzing Feature: coast

coast - most houses donot have waterfront view, very few are waterfront
Analyzing Feature: sight

sight - most sights have not been viewed
30
Analyzing Feature: condition

condition - Overall most houses are rated as 3 and above for its condition overall
Analyzing Feature: quality

Quality - most properties have quality rating between 6 to 10
checking the no. of data points with quality rating as 13
31
There are only 13 propeties which have the highest quality rating
Analyzing Feature: ceil_measure

ceil_measure - its highly skewed
32
There is no pattern in Ceil Vs Ceil_measure
The vertical lines at each point represent the inter quartile range of values at that point
Analyzing Feature: basement
We can see 2 gaussians, which tells us there are propeties which don't have basements and
some have the basements
We have almost 60% of the properties without basement
Houses have zero measure of basement i.e. they do not have basements
So we have plotted boxplot for properties which have basements only.
33
We can clearly see, there are outliers. We need to treat this before our model.
Now we will check the no. of data points with 'basement' greater than 4000
We have only 2 properties with more than 4,000 measure basement
Distribution of houses having basement
34
Distribution having basement is right-skewed
Analyzing Feature: yr_built
house range from new to very old
The built year of the properties range from 1900 to 2014 and we can see upward trend with time
35
Analyzing Feature: yr_renovated
Only 914 houses were renovated out of 21613 houses
yr_renovated - plot of houses which are renovated
Analyzing Feature: furnished
Most properties are not furnished. Furnish column need to be converted into categorical column
36
BIVARIATE ANALYSIS
PairPlot: We have plotted all the variables and confirmed our above
deduction with more confidence.
From above pair plot, we observed/deduced below
1. price: price distribution is Right-Skewed as we deduced earlier from our 5-factor analysis
2. room_bed: our target variable (price) and room_bed plot is not linear. It's distribution
have lot of gaussians
3. room_bath: It's plot with price has somewhat linear relationship. Distribution has
number of gaussians.
4. living_measure: Plot against price has strong linear relationship. It also have linear
relationship with room_bath variable. So might remove one of these 2. Distribution is
Right-Skewed.
5. lot_measure: No clear relationship with price.
37
6. ceil: No clear relationship with price. We can see, it's have 6 unique values only.
Therefore, we can convert this column into categorical column for values.
7. coast: No clear relationship with price. Clearly it's categorical variable with 2 unique
values.
8. sight: No clear relationship with price. This has 5 unique values. Can be converted
to Categorical variable.
9. condition: No clear relationship with price. This has 5 unique values. Can
be converted to Categorical variable.
10. quality: Somewhat linear relationship with price. Has discrete values from 1 - 13.
Can be converted to Categorical variable.
11. ceil_measure: Strong linear relationship with price. Also with room_bath and
living_measure features. Distribution is Right-Skewed.
12. basement: No clear relationship with price.
13. yr_built: No clear relationship with price.
14. yr_renovated: No clear relationship with price. Have 2 unique values. Can be
converted to Categorical Variable which tells whether house is renovated or not.
15. zipcode, lat, long: No clear relationship with price or any other feature.
16. living_measure15: Somewhat linear relationship with target feature. It's same as
living_measure. Therefore we can drop this variable.
17. lot_measure15: No clear relationship with price or any other feature.
18. furnished: No clear relationship with price or any other feature. 2 unique values so
can be converted to Categorical Variable
19. total_area: No clear relationship with price. But it has Very Strong linear
relationship with lot_measure. So one of it can be dropped.
In brief, below featues should be converted to Categorical Varia

ble
ceil, coast, sight, condition, quality, yr_renovated, furnis

hed
And below columns can be dropped after checking pearson factor
zipcode, lat, long, living_measure15, lot_measure15, total_a

rea
Now we will check corelation between the different features
Table in Jupyter notebook.

We have linear relationships in below featues as we got to know from above matrix
1. price: room_bath, living_measure, quality, living_measure15, furnished

2. living_measure: price, room_bath. So we can consider dropping 'room_bath' variable.
3. quality: price, room_bath, living_measure
4. ceil_measure: price, room_bath, living_measure, quality
5. living_measure15: price, living_measure, quality. So we can consider dropping
living_measure15 as well. As it's giving same info as living_measure.
6. lot_measure15: lot_measure. Therefore, we can consider dropping lot_measure15, as
it's giving same info.
7. furnished: quality
8. total_area: lot_measure, lot_measure15. Therefore, we can consider dropping total_area
feature as well. As it's giving same info as lot_measure.
38
We have plotted heatmap and can easily confirm our above findings
Analyzing Bivariate for Feature: month_year

month,year in which house is sold. Price is not influenced by it, though there are outliers and can be
easily seen.
39
40
The mean price of the houses tend to be high during March, April, May as compared to that of
September, October, November ,December period.
Analyzing Bivariate for Feature: room_bed
Room_bed - outliers can be seen easily. Mean and median of price increases with number
bedrooms/house up till a point and then drops.
41
There is clear increasing trend in price with room_bed¶
room_bath - outliers can be seen easily. Overall mean and median price increases with increasing
room_bath
There is upward trend in price with increase in room_bath
Analyzing Bivariate for Feature: living_measure
living_measure - price increases with increase in living measure
42
There is clear increment in price of the property with increment in the living measure But there
seems to be one outlier to this trend. Need to evaluate the same.
Analyzing Bivariate for Feature: lot_measure

lot_measure - there seems to be no relation between lot_measure and price
lot_measure - data value range is very large so breaking it get better view.
43
There doesnt seem to be no relation between lot_measure and price trend
Almost 95% of the houses have <25000 lot_measure. But there is no clear trend between
lot_measure and price
44
lot_measure >100000 - price increases with increase in living measure
Analyzing Bivariate for Feature: ceil

ceil - median price increases initially and then falls
45
There is some slight upward trend in price with the ceil.
Analyzing Bivariate for Feature: coast

coast - mean and median of waterfront view is high however such houses are very small in compare
to non-waterfront
Also, living_measure mean and median is greater for waterfront house.
46
The house properties with water_front tend to have higher price compared to that of non-
water_front properties
Analyzing Bivariate for Feature: sight
sight - have outliers. The house sighted more have high price (mean and median) and have large
living area as well.
47
Properties with higher price have more no.of sights compared to that of houses with lower price
Sight - Viewed in relation with price and living_measure
Costlier houses with large living area are sighted more.
The above graph also justify that: Properties with higher price have more no.of sights compared
to that of houses with lower price
Analyzing Bivariate for Feature: condition

condition - as the condition rating increases its price and living measure mean and median also
increases.
48
The price of the house increases with condition rating of the house
Condition - Viewed in relation with price and living_measure. Most houses are rated as 3 or more.
We can see some outliers as well.
49
So we found out that smaller houses are in better condition and better condition houses are
having higher prices
Analyzing Bivariate for Feature: quality

quality - with grade increase price and living_measure increase (mean and median)
There is clear increase in price of the house with higher rating on quality
quality - Viewed in relation with price and living_measure. Most houses are graded as 6 or more. We
can see some outliers as well
50
Analyzing Bivariate for Feature: ceil_measure

ceil_measure - price increases with increase in ceil measure
There is upward trend in price with ceil_measure
51
Analyzing Bivariate for Feature: basement
We will create the categorical variable for basement 'has_basement' for houses with basement
and no basement.This categorical variable will be used for further analysis.
Binning Basement to analyse data

basement - after binning we data shows with basement houses are costlier and have higher
living measure (mean & median)
The houses with basement has better price compared to that of houses without basement
basement - have higher price & living measure
52
yr_built - outliers can be seen easily.
We will create new variable: Houselandratio - This is proportion of living area in the total area of
the house. We will explore the trend of price against this houselandratio.
HouseLandRatio - Computing new variable as ratio of living_measure/total_area
Signifies - Land used for construction of house
53
Analyzing Bivariate for Feature: yr_renovated
So most houses are renovated after 1980's. We will create new categorical variable
'has_renovated' to categorize the property as renovated and non-renovated. For further
ananlysis we will use this categorical variable.
Lets try to group yr_renovated
#Binning Basement to analyse data
has_renovated - renovated have higher mean and median, however it does not confirm if the prices
of house renovated
#actually increased or not.
#HouseLandRatio - Renovated house utilized more land area for construction of house
54
Renovated properties have higher price than others with same living measure space.
Analyzing Bivariate for Feature: furnished

furnished - Furnished has higher price value and has greater living_measure
55
Furnished houses have higher price than that of the Non-furnished houses
56
DATA PROCESSING
Treating Outlilers
We have seen outliers for columns room_bath(33 bed),

living_measure, lot_measure, ceil_measure and Basement¶
ceil_measure
After treating outliers of ceil_measure, the data has reduced by about 600(~3%) data points but
data is nice.
Treating outliers for column - basement
We got 408 records as outliers, let's drop these outliers
57
After treating outliers of basement, we can see that 400(~2%) data

points got imputed. Total about 5% data has been imputed after
treating ceil_measure and basement.
Let's see the boxplot now for basement
Treating outliers for column - living_measure
We got 178 records as outliers. Let's treat this by dropping
let's see the boxplot after dropping the outliers
58
By treating outliers of living_measure, we lost 178 data points more

and data distribution looks normal
shape of the data after imputing outliers in living_column : (20416, 27)
Treating outliers for column - lot_measure
We got some records which are outliers. Let's drop these outlier records.’
let's plot after treating outliers
59
Total outliers in the lot_measure are 2124 data points. But still we are going ahead with imputing
the data. We will analyze later whether there is any impact on the data set or not
Treating outliers for column - room_bed
As we know for room_bed = 33 was outlier from our earlier findings, let's see the record and drop it
dropping the record from the dataset
In summary, after treating outliers, we have lost about 15% of the data. We will analyse the
impact of this data loss during the model evaluation.
4. Business insights from EDA

a) Is the data unbalanced? If so, what can be done? Please explain in the context of the
business b) Any business insights using clustering (if applicable) c) Any other business
insights
Answer:
The data provided was a mix of numerical and categorical variable to it was necessary to take that in
consideration.
We observed the presence of outliers that depicted that the data was unbalanced but we have
treated the outliers as well as the missing values.
Some errors showed that it could be some kind of data entry errors.
sometimes, business could be at loss due to such mistakes.
We can us either pincode data set to get an idea of the cities and the county it belongs to.
60

PN1 Shakti Akshaya S PDF

Uploaded by

Copyright:

Available Formats

PN1 Shakti Akshaya S PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PN1 Shakti Akshaya S PDF

Uploaded by

Copyright:

Available Formats

What is the business problem being addressed in the report?

What is the business problem being addressed in the report?

What data cleaning steps were taken and why?

What data cleaning steps were taken and why?

House Price Prediction

HOUSE PRICE PREDICTION

Great Learning – PGP- Data Science and Business Analytics

Problem Statement …………………………………………………………………………………………………3

1) Introduction of the business problem

a) Defining problem statement

b) Need of the study/project

c) Understanding business/social opportunity

Number of records and features/aspects we have in the data: (21613, 23)

we have 21613 rows and 23 columns

Let's check out the columns/features we have in the dataset:

These columns provide below information

let's see the data types of the features:

Now we will check if we have missing values:

We have 689 missing values.

living_measure15, room_bed, room_bath , sight, lot_measure, ceil , lot_measure15, total_area,

After replacing the characters we have a total of 864 null values.

Analysing it we can understand that this is a mistake in terms of data entry.

So we will treat it after univariate and bi variate analysis.

Now, we will do the 5 - factor analysis of the features.

1. CID: House ID/Property ID.Not used for analysis

From above analysis we got to know,

Q 3. Exploratory Data Analysis

Exploratory Data Analysis

Let's do some visual data analysis of the features

Univariate Analysis - By BoxPlot

Analyzing Feature: cid

Analyzing Feature: dayhours

We converted the dayhours to 'month_year' as sale month-year is relevant for analysis

We successfully converted dayhours feature to month_year for better analysis.

We can see, most houses sold in April, July month

Analyzing Feature: Price (our Target)

The Price is ranging from 75,000 to 77,00,000 and distribution is right-skewed.

Analyzing Feature: room_bed

Most of the houses/properties have 3 or 4 bedrooms

Analyzing Feature: room_bath

Majority of the properties have bathroom in the range of 1.0 to 2.5

Analyzing Feature: Living measure

Data distribution tells us, living_measure is right-skewed.

Plotting the boxplot for living_measure

Analyzing Feature: lot_measure

Data is skewed as visible from plot.

Analyzing Feature: ceil

We can see, most houses have 1 floor

Analyzing Feature: coast

Analyzing Feature: sight

Analyzing Feature: condition

Analyzing Feature: quality

checking the no. of data points with quality rating as 13

Analyzing Feature: ceil_measure

There is no pattern in Ceil Vs Ceil_measure

Analyzing Feature: basement

We have almost 60% of the properties without basement

So we have plotted boxplot for properties which have basements only.

We have only 2 properties with more than 4,000 measure basement