PN1 Shakti Akshaya S PDF
PN1 Shakti Akshaya S PDF
PN1 Shakti Akshaya S PDF
BUSINESS REPORT
CAPSTONE PROJECT
- Shakti Akshaya S
1
House Price Prediction
Table of contents:
2
House Price Prediction
Problem statement:
A house value is simply more than location and square footage. Like the features that make up a
person, an educated party would want to know all aspects that give a house its value. For
example, you want to sell a house and you don’t know the price which you may expect — it can’t
be too low or too high. To find house price you usually try to find similar properties in your
neighborhood and based on gathered data you will try to assess your house price.
The price of a house is dependent on various factors like size or area, how many bedrooms, location, the price of
other houses, and many other factors. Real estate investors would like to find out the actual cost of the house in
order to buy and sell real estate properties. They will lose money when they pay more than the current market
cost or when they sell for less money according to the current market cost. Banks also want to find out about the
current market price of the house when they use someone’s house as collateral for loans. Sometimes the loan
applicant overvalues their house to borrow maximum loan from bank. Local home buyers can also predict the
price of the house to find out if a seller is asking too much. The local seller can also predict their house price and
find out how much is a fair market price.
The purpose of the project is to have a better understanding of algorithms learnt from the academic lecture by
implementing the algorithms to the specific problem and check if the result matches the expectation. The main
goal for the project is to predict the price of residential homes .
3
House Price Prediction
Q 2. Data Report
a) Understanding how data was collected in terms of time, frequency and methodology
b) Visual inspection of data (rows, columns, descriptive details) c) Understanding of
attributes (variable info, renaming if required).
Answer:
Dimensions of Data:
From the above we can see the different columns we have in dataset.
1. cid: Notation for a house. It's not of our use. So we will drop this column
2. dayhours: Represents Date, when house was sold.
3. price: It's our TARGET feature, that we have to predict based on other featues
4. room_bed: Represents number of bedrooms in a house
5. room_bath: Represents number of bathrooms
6. living_measure: Represents square footage of house
7. lot_measure: Represents square footage of lot
8. ceil: Represents number of floors in house
9. coast: Represents whether house has waterfront view. It seems to be a categorical
variable. We will see in our further data analysis
10. sight: Represents how many times sight has been viewed.
11. condition: Represents the overall condition of the house. It's kind of rating given to the
house.
12. quality: Represents grade given to the house based on grading system
13. ceil_measure: Represents square footage of house apart from basement
14. basement: Represents square footage of basement
15. yr_built: Represents the year when house was built
16. yr_renovated: Represents the year when house was last renovated
17. zipcode: Represents zipcode as name implies
18. lat: Represents Lattitude co-ordniates
19. long: Represents Longitude co-ordinates
4
House Price Prediction
20. living_measure15: Represents square footage of house, when measured in 2015 year
as house area may or may not have changed after renovation if any happened
21. lot_measure15: Represents square footage of lot, when measured in 2015 year as lot
area may or may not change after renovation if any done
22. furnished: Tells whether house is furnished or not. It seems to be categorical variable as
description implies
23. total_area: Represents total area i.e. area of both living and lot
5
House Price Prediction
Now, we will check for duplicate values and we see that We don't have any duplicate values.
No. of duplicates : 0
We will arrange it in an order so that we can easily see which variables has missing values.
6
House Price Prediction
When we study the data we have seen that we have some special characters such as $, so we
will replace them with null values and will later treat the null values.
Now, we have a mix of categorical and numerical variable with null value, we will replace them
with mode.
7
House Price Prediction
8
House Price Prediction
When we manually checked the data we saw that there is a data where we can see the
room_bed is 33.
9
House Price Prediction
13. ceil_measure: Square footage of house apart from basement ranges in 290 - 9,410. As
Mean > Median, it's Right-Skewed.
14. basement: Square footage house basement ranges in 0 - 4,820. As Mean highlty >
Median, it's Highly Right-Skewed.
15. yr_built: House built year ranges from 1900 - 2015. As Mean < Median, it's Left-
Skewed.
16. yr_renovated: House renovation year only 2015. So this column can be used
as Categorical Variable for knowing whether house is renovated or not.
17. zipcode: House ZipCode ranges from 98001 - 98199. As Mean > Median, it's Right-
Skewed.
18. lat: Lattitude ranges from 47.1559 - 47.7776 As Mean < Median, it's Left-Skewed.
19. long: Longittude ranges from -122.5190 to -121.315 As Mean > Median, it's Right-
Skewed.
20. living_measure15: Value ragnes from 399 to 6,210. As Mean > Median, it's Right-
Skewed.
21. lot_measure15: Value ragnes from 651 to 8,71,200. As Mean highly > Median,
it's Highly Right-Skewed.
22. furnished: Representing whether house is furnished or not. It's a Categorical Variable
23. total_area Total area of house ranges from 1,423 to 16,52,659. As Mean is almost
double of Median, it's Highly Right-Skewed
Most columns distribution is Right-Skewed and only few features are Left-Skewed (like
room_bath, yr_built, lat).
We have columns which are Categorical in nature are -> coast, yr_renovated, furnished
10
House Price Prediction
11
House Price Prediction
12
House Price Prediction
13
House Price Prediction
14
House Price Prediction
15
House Price Prediction
16
House Price Prediction
17
House Price Prediction
18
House Price Prediction
19
House Price Prediction
20
House Price Prediction
21
House Price Prediction
We can see, there are lot of features which have outliers. So we might need to treat those before
building model
cid - CID is appearing muliple times, it seems data contains house which is sold multiple times
We have 176 properties that were sold more than once in the given data
22
House Price Prediction
So the time line of the sale data of the properties is from May-2014 to May-2015 and April month
have the highest mean price.
23
House Price Prediction
24
House Price Prediction
The value of 33 seems to be outlier we need to check the data point before imputing
the same
Will delete this data point after bivariate analysis as it looks to be an outlier as it has low price for
33 bed room property
25
House Price Prediction
Skewness is : 0.5051364190311984
26
House Price Prediction
27
House Price Prediction
There are many outliers in living measure. Need to review further to treat the same.
We will check the number of data points with Living measure greater than 8000,
We have only 9 properties/house which have more than 8k living_measure. So will treat these
outliers.
28
House Price Prediction
Have checked the no. of data points with Lot measure greater than 1250000.
We have only 1 property with more than 12,50,000 lot_measure. So we need to treat this
29
House Price Prediction
Above graph confirming the same, that most properties have 1 and 2 floors
30
House Price Prediction
31
House Price Prediction
There are only 13 propeties which have the highest quality rating
32
House Price Prediction
The vertical lines at each point represent the inter quartile range of values at that point
We can see 2 gaussians, which tells us there are propeties which don't have basements and
some have the basements
Houses have zero measure of basement i.e. they do not have basements
33
House Price Prediction
We can clearly see, there are outliers. We need to treat this before our model.
Now we will check the no. of data points with 'basement' greater than 4000
34
House Price Prediction
The built year of the properties range from 1900 to 2014 and we can see upward trend with time
35
House Price Prediction
Most properties are not furnished. Furnish column need to be converted into categorical column
36
House Price Prediction
BIVARIATE ANALYSIS
PairPlot: We have plotted all the variables and confirmed our above
deduction with more confidence.
1. price: price distribution is Right-Skewed as we deduced earlier from our 5-factor analysis
2. room_bed: our target variable (price) and room_bed plot is not linear. It's distribution
have lot of gaussians
3. room_bath: It's plot with price has somewhat linear relationship. Distribution has
number of gaussians.
4. living_measure: Plot against price has strong linear relationship. It also have linear
relationship with room_bath variable. So might remove one of these 2. Distribution is
Right-Skewed.
5. lot_measure: No clear relationship with price.
37
House Price Prediction
6. ceil: No clear relationship with price. We can see, it's have 6 unique values only.
Therefore, we can convert this column into categorical column for values.
7. coast: No clear relationship with price. Clearly it's categorical variable with 2 unique
values.
8. sight: No clear relationship with price. This has 5 unique values. Can be converted
to Categorical variable.
9. condition: No clear relationship with price. This has 5 unique values. Can
be converted to Categorical variable.
10. quality: Somewhat linear relationship with price. Has discrete values from 1 - 13.
Can be converted to Categorical variable.
11. ceil_measure: Strong linear relationship with price. Also with room_bath and
living_measure features. Distribution is Right-Skewed.
12. basement: No clear relationship with price.
13. yr_built: No clear relationship with price.
14. yr_renovated: No clear relationship with price. Have 2 unique values. Can be
converted to Categorical Variable which tells whether house is renovated or not.
15. zipcode, lat, long: No clear relationship with price or any other feature.
16. living_measure15: Somewhat linear relationship with target feature. It's same as
living_measure. Therefore we can drop this variable.
17. lot_measure15: No clear relationship with price or any other feature.
18. furnished: No clear relationship with price or any other feature. 2 unique values so
can be converted to Categorical Variable
19. total_area: No clear relationship with price. But it has Very Strong linear
relationship with lot_measure. So one of it can be dropped.
38
House Price Prediction
We have plotted heatmap and can easily confirm our above findings
39
House Price Prediction
40
House Price Prediction
The mean price of the houses tend to be high during March, April, May as compared to that of
September, October, November ,December period.
Room_bed - outliers can be seen easily. Mean and median of price increases with number
bedrooms/house up till a point and then drops.
41
House Price Prediction
room_bath - outliers can be seen easily. Overall mean and median price increases with increasing
room_bath
42
House Price Prediction
There is clear increment in price of the property with increment in the living measure But there
seems to be one outlier to this trend. Need to evaluate the same.
lot_measure - data value range is very large so breaking it get better view.
43
House Price Prediction
Almost 95% of the houses have <25000 lot_measure. But there is no clear trend between
lot_measure and price
44
House Price Prediction
45
House Price Prediction
46
House Price Prediction
The house properties with water_front tend to have higher price compared to that of non-
water_front properties
sight - have outliers. The house sighted more have high price (mean and median) and have large
living area as well.
47
House Price Prediction
Properties with higher price have more no.of sights compared to that of houses with lower price
The above graph also justify that: Properties with higher price have more no.of sights compared
to that of houses with lower price
48
House Price Prediction
The price of the house increases with condition rating of the house
Condition - Viewed in relation with price and living_measure. Most houses are rated as 3 or more.
We can see some outliers as well.
49
House Price Prediction
So we found out that smaller houses are in better condition and better condition houses are
having higher prices
There is clear increase in price of the house with higher rating on quality
quality - Viewed in relation with price and living_measure. Most houses are graded as 6 or more. We
can see some outliers as well
50
House Price Prediction
51
House Price Prediction
We will create the categorical variable for basement 'has_basement' for houses with basement
and no basement.This categorical variable will be used for further analysis.
The houses with basement has better price compared to that of houses without basement
52
House Price Prediction
We will create new variable: Houselandratio - This is proportion of living area in the total area of
the house. We will explore the trend of price against this houselandratio.
53
House Price Prediction
So most houses are renovated after 1980's. We will create new categorical variable
'has_renovated' to categorize the property as renovated and non-renovated. For further
ananlysis we will use this categorical variable.
has_renovated - renovated have higher mean and median, however it does not confirm if the prices
of house renovated
#HouseLandRatio - Renovated house utilized more land area for construction of house
54
House Price Prediction
Renovated properties have higher price than others with same living measure space.
55
House Price Prediction
Furnished houses have higher price than that of the Non-furnished houses
56
House Price Prediction
DATA PROCESSING
Treating Outlilers
ceil_measure
After treating outliers of ceil_measure, the data has reduced by about 600(~3%) data points but
data is nice.
57
House Price Prediction
58
House Price Prediction
We got some records which are outliers. Let's drop these outlier records.’
59
House Price Prediction
Total outliers in the lot_measure are 2124 data points. But still we are going ahead with imputing
the data. We will analyze later whether there is any impact on the data set or not
As we know for room_bed = 33 was outlier from our earlier findings, let's see the record and drop it
In summary, after treating outliers, we have lost about 15% of the data. We will analyse the
impact of this data loss during the model evaluation.
Answer:
The data provided was a mix of numerical and categorical variable to it was necessary to take that in
consideration.
We observed the presence of outliers that depicted that the data was unbalanced but we have
treated the outliers as well as the missing values.
Some errors showed that it could be some kind of data entry errors.
sometimes, business could be at loss due to such mistakes.
We can us either pincode data set to get an idea of the cities and the county it belongs to.
60