PN1 Shakti Akshaya S PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60
At a glance
Powered by AI
The report analyzes housing data to predict house prices based on various attributes like size, location, number of bedrooms etc. Outliers were identified and removed to clean the data and make it suitable for analysis and modeling.

The business problem is predicting the right price of houses for sale based on their attributes to avoid pricing them too high or too low. This is important for sellers, buyers, banks and real estate investors.

Outliers were identified and removed from several numerical fields like living space, lot size, ceiling height etc. to make the data distribution more normal. This was done to clean the data as outliers can influence results. About 15% of the data was removed during this process.

House Price Prediction

BUSINESS REPORT

CAPSTONE PROJECT

HOUSE PRICE PREDICTION

Great Learning – PGP- Data Science and Business Analytics

- Shakti Akshaya S

1
House Price Prediction

Table of contents:

Problem Statement …………………………………………………………………………………………………3


Introduction of Business Problem……………………………………………………………………………3
Data Report…………………………………………………………………………………………………………….4
Column Information……………………………………………………………………………………………….4
Data Types………………………………………………………………………………………………………………5
Duplicates and Missing Values Check……………………………………………………………………..6-9
Summary of data……………………………………………………………………………………………………10
Exploratory Data Analysis………………………………………………………………………………………11
Univariate Analysis………………………………………………………………………………………………11-36
Bivariate analysis……………………………………………………………………………………….………..37-56
Data Processing…………………………………………………………………………………………………..57
Outlier Treatment……………………………………………………………………………………………….57-60
Business Insights………………………………………………………………………………………………….60

2
House Price Prediction

Problem statement:
A house value is simply more than location and square footage. Like the features that make up a
person, an educated party would want to know all aspects that give a house its value. For
example, you want to sell a house and you don’t know the price which you may expect — it can’t
be too low or too high. To find house price you usually try to find similar properties in your
neighborhood and based on gathered data you will try to assess your house price.

1) Introduction of the business problem

a) Defining problem statement


When any person/business wants to sell or buy a house, they always face this kind of issue as they don't know
the price which they should offer. Due to this they might be offering too low or high for the property. Therefore,
we can analyze the available data of the properties in the area and can predict the price. We need to find how
these attributes influence the house prices Right pricing is very important aspect to sell house. It is very
important to understand what are the factors and how they influence the house price. Objective is to predict the
right price of the house based on the attributes.

The price of a house is dependent on various factors like size or area, how many bedrooms, location, the price of
other houses, and many other factors. Real estate investors would like to find out the actual cost of the house in
order to buy and sell real estate properties. They will lose money when they pay more than the current market
cost or when they sell for less money according to the current market cost. Banks also want to find out about the
current market price of the house when they use someone’s house as collateral for loans. Sometimes the loan
applicant overvalues their house to borrow maximum loan from bank. Local home buyers can also predict the
price of the house to find out if a seller is asking too much. The local seller can also predict their house price and
find out how much is a fair market price.

b) Need of the study/project


To analyse and predict the price of house by using the list of feature variables which are given in the dataset.

The purpose of the project is to have a better understanding of algorithms learnt from the academic lecture by
implementing the algorithms to the specific problem and check if the result matches the expectation. The main
goal for the project is to predict the price of residential homes .

c) Understanding business/social opportunity


As people don't know the features/aspects which cumulate property price, we can provide them House
Buying Selling guiding services in the area so they can buy or sell their property with most suitable
price tag and they didn't lose their hard-earned money by offering low price or keep waiting for
buyers by putting high prices.

3
House Price Prediction

Q 2. Data Report
a) Understanding how data was collected in terms of time, frequency and methodology
b) Visual inspection of data (rows, columns, descriptive details) c) Understanding of
attributes (variable info, renaming if required).

Answer:
Dimensions of Data:

Number of records and features/aspects we have in the data: (21613, 23)

we have 21613 rows and 23 columns

Let's check out the columns/features we have in the dataset:

From the above we can see the different columns we have in dataset.

These columns provide below information

1. cid: Notation for a house. It's not of our use. So we will drop this column
2. dayhours: Represents Date, when house was sold.
3. price: It's our TARGET feature, that we have to predict based on other featues
4. room_bed: Represents number of bedrooms in a house
5. room_bath: Represents number of bathrooms
6. living_measure: Represents square footage of house
7. lot_measure: Represents square footage of lot
8. ceil: Represents number of floors in house
9. coast: Represents whether house has waterfront view. It seems to be a categorical
variable. We will see in our further data analysis
10. sight: Represents how many times sight has been viewed.
11. condition: Represents the overall condition of the house. It's kind of rating given to the
house.
12. quality: Represents grade given to the house based on grading system
13. ceil_measure: Represents square footage of house apart from basement
14. basement: Represents square footage of basement
15. yr_built: Represents the year when house was built
16. yr_renovated: Represents the year when house was last renovated
17. zipcode: Represents zipcode as name implies
18. lat: Represents Lattitude co-ordniates
19. long: Represents Longitude co-ordinates

4
House Price Prediction

20. living_measure15: Represents square footage of house, when measured in 2015 year
as house area may or may not have changed after renovation if any happened
21. lot_measure15: Represents square footage of lot, when measured in 2015 year as lot
area may or may not change after renovation if any done
22. furnished: Tells whether house is furnished or not. It seems to be categorical variable as
description implies
23. total_area: Represents total area i.e. area of both living and lot

let's see the data types of the features:

5
House Price Prediction

Now, we will check for duplicate values and we see that We don't have any duplicate values.

No. of duplicates : 0

Now we will check if we have missing values:

We have 689 missing values.

We will arrange it in an order so that we can easily see which variables has missing values.

6
House Price Prediction

living_measure15, room_bed, room_bath , sight, lot_measure, ceil , lot_measure15, total_area,


living_measure, ceil_measure, basement, yr_built -- these all are numerical variables , so we will
impute the values by mode accordingly.

When we study the data we have seen that we have some special characters such as $, so we
will replace them with null values and will later treat the null values.

After replacing the characters we have a total of 864 null values.

Now, we have a mix of categorical and numerical variable with null value, we will replace them
with mode.

7
House Price Prediction

8
House Price Prediction

When we manually checked the data we saw that there is a data where we can see the
room_bed is 33.

Analysing it we can understand that this is a mistake in terms of data entry.

So we will treat it after univariate and bi variate analysis.

Now, we will do the 5 - factor analysis of the features.

1. CID: House ID/Property ID.Not used for analysis


2. Dayhours: 5 factor analysis is reflecting for this column
3. price: Our taget column value is in 75k - 7700k range. As Mean > Median, it's Right-
Skewed.
4. room_bed: Number of bedrooms range from 0 - 11. As Mean slightly > Median,
it's slightly Right-Skewed.
5. room_bath: Number of bathrooms range from 0 - 8. As Mean slightly < Median,
it's slightly Left-Skewed.
6. living_measure: Square footage of house range from 290 - 13,540. As Mean > Median,
it's Right-Skewed.
7. lot_measure: Square footage of lot range from 520 - 16,51,359. As Mean almost double
of Median, it's Hightly Right-Skewed.
8. ceil: Number of floors range from 1 - 3.5 As Mean ~ Median, it's almost Normal
Distributed.
9. coast: As this value represent whether house has waterfront view or not. It's categorical
column. From above analysis we got know, very few houses has waterfront view.
10. sight: Value ranges from 0 - 4. As Mean > Median, it's Right-Skewed
11. condition: Represents rating of house which ranges from 1 - 5. As Mean > Median,
it's Right-Skewed
12. quality: Representign grade given to house which range from 1 - 13. As Mean > Median,
it's Right-Skewed.

9
House Price Prediction

13. ceil_measure: Square footage of house apart from basement ranges in 290 - 9,410. As
Mean > Median, it's Right-Skewed.
14. basement: Square footage house basement ranges in 0 - 4,820. As Mean highlty >
Median, it's Highly Right-Skewed.
15. yr_built: House built year ranges from 1900 - 2015. As Mean < Median, it's Left-
Skewed.
16. yr_renovated: House renovation year only 2015. So this column can be used
as Categorical Variable for knowing whether house is renovated or not.
17. zipcode: House ZipCode ranges from 98001 - 98199. As Mean > Median, it's Right-
Skewed.
18. lat: Lattitude ranges from 47.1559 - 47.7776 As Mean < Median, it's Left-Skewed.
19. long: Longittude ranges from -122.5190 to -121.315 As Mean > Median, it's Right-
Skewed.
20. living_measure15: Value ragnes from 399 to 6,210. As Mean > Median, it's Right-
Skewed.
21. lot_measure15: Value ragnes from 651 to 8,71,200. As Mean highly > Median,
it's Highly Right-Skewed.
22. furnished: Representing whether house is furnished or not. It's a Categorical Variable
23. total_area Total area of house ranges from 1,423 to 16,52,659. As Mean is almost
double of Median, it's Highly Right-Skewed

From above analysis we got to know,

Most columns distribution is Right-Skewed and only few features are Left-Skewed (like
room_bath, yr_built, lat).

We have columns which are Categorical in nature are -> coast, yr_renovated, furnished

10
House Price Prediction

Q 3. Exploratory Data Analysis


a) Univariate analysis (distribution and spread for every continuous attribute, distribution of
data in categories for categorical ones) b) Bivariate analysis (relationship between different
variables , correlations) a) Removal of unwanted variables (if applicable) b) Missing Value
treatment (if applicable) d) Outlier treatment (if required) e) Variable transformation (if
applicable) f) Addition of new variables (if required)

Exploratory Data Analysis

Let's do some visual data analysis of the features

Univariate Analysis - By BoxPlot

11
House Price Prediction

12
House Price Prediction

13
House Price Prediction

14
House Price Prediction

15
House Price Prediction

16
House Price Prediction

17
House Price Prediction

18
House Price Prediction

19
House Price Prediction

20
House Price Prediction

21
House Price Prediction

We can see, there are lot of features which have outliers. So we might need to treat those before
building model

Analyzing Feature: cid

cid - CID is appearing muliple times, it seems data contains house which is sold multiple times

We have 176 properties that were sold more than once in the given data

Analyzing Feature: dayhours


we created new data frame that can be used for modeling

We converted the dayhours to 'month_year' as sale month-year is relevant for analysis

We successfully converted dayhours feature to month_year for better analysis.

22
House Price Prediction

We can see, most houses sold in April, July month

So the time line of the sale data of the properties is from May-2014 to May-2015 and April month
have the highest mean price.

Analyzing Feature: Price (our Target)

23
House Price Prediction

The Price is ranging from 75,000 to 77,00,000 and distribution is right-skewed.

Analyzing Feature: room_bed

24
House Price Prediction

The value of 33 seems to be outlier we need to check the data point before imputing
the same

Will delete this data point after bivariate analysis as it looks to be an outlier as it has low price for
33 bed room property

Most of the houses/properties have 3 or 4 bedrooms

Analyzing Feature: room_bath

25
House Price Prediction

Majority of the properties have bathroom in the range of 1.0 to 2.5

Skewness is : 0.5051364190311984

Analyzing Feature: Living measure

26
House Price Prediction

Data distribution tells us, living_measure is right-skewed.

Plotting the boxplot for living_measure

27
House Price Prediction

There are many outliers in living measure. Need to review further to treat the same.

We will check the number of data points with Living measure greater than 8000,

We have only 9 properties/house which have more than 8k living_measure. So will treat these
outliers.

Analyzing Feature: lot_measure

Data is skewed as visible from plot.

28
House Price Prediction

Have checked the no. of data points with Lot measure greater than 1250000.

We have only 1 property with more than 12,50,000 lot_measure. So we need to treat this

Analyzing Feature: ceil


let's see the ceil count for all the records

29
House Price Prediction

We can see, most houses have 1 floor

Above graph confirming the same, that most properties have 1 and 2 floors

Analyzing Feature: coast


coast - most houses donot have waterfront view, very few are waterfront

Analyzing Feature: sight


sight - most sights have not been viewed

30
House Price Prediction

Analyzing Feature: condition


condition - Overall most houses are rated as 3 and above for its condition overall

Analyzing Feature: quality


Quality - most properties have quality rating between 6 to 10

checking the no. of data points with quality rating as 13

31
House Price Prediction

There are only 13 propeties which have the highest quality rating

Analyzing Feature: ceil_measure


ceil_measure - its highly skewed

32
House Price Prediction

There is no pattern in Ceil Vs Ceil_measure

The vertical lines at each point represent the inter quartile range of values at that point

Analyzing Feature: basement

We can see 2 gaussians, which tells us there are propeties which don't have basements and
some have the basements

We have almost 60% of the properties without basement

Houses have zero measure of basement i.e. they do not have basements

So we have plotted boxplot for properties which have basements only.

33
House Price Prediction

We can clearly see, there are outliers. We need to treat this before our model.

Now we will check the no. of data points with 'basement' greater than 4000

We have only 2 properties with more than 4,000 measure basement

Distribution of houses having basement

34
House Price Prediction

Distribution having basement is right-skewed

Analyzing Feature: yr_built

house range from new to very old

The built year of the properties range from 1900 to 2014 and we can see upward trend with time

35
House Price Prediction

Analyzing Feature: yr_renovated

Only 914 houses were renovated out of 21613 houses

yr_renovated - plot of houses which are renovated

Analyzing Feature: furnished

Most properties are not furnished. Furnish column need to be converted into categorical column

36
House Price Prediction

BIVARIATE ANALYSIS
PairPlot: We have plotted all the variables and confirmed our above
deduction with more confidence.

From above pair plot, we observed/deduced below

1. price: price distribution is Right-Skewed as we deduced earlier from our 5-factor analysis
2. room_bed: our target variable (price) and room_bed plot is not linear. It's distribution
have lot of gaussians
3. room_bath: It's plot with price has somewhat linear relationship. Distribution has
number of gaussians.
4. living_measure: Plot against price has strong linear relationship. It also have linear
relationship with room_bath variable. So might remove one of these 2. Distribution is
Right-Skewed.
5. lot_measure: No clear relationship with price.

37
House Price Prediction

6. ceil: No clear relationship with price. We can see, it's have 6 unique values only.
Therefore, we can convert this column into categorical column for values.
7. coast: No clear relationship with price. Clearly it's categorical variable with 2 unique
values.
8. sight: No clear relationship with price. This has 5 unique values. Can be converted
to Categorical variable.
9. condition: No clear relationship with price. This has 5 unique values. Can
be converted to Categorical variable.
10. quality: Somewhat linear relationship with price. Has discrete values from 1 - 13.
Can be converted to Categorical variable.
11. ceil_measure: Strong linear relationship with price. Also with room_bath and
living_measure features. Distribution is Right-Skewed.
12. basement: No clear relationship with price.
13. yr_built: No clear relationship with price.
14. yr_renovated: No clear relationship with price. Have 2 unique values. Can be
converted to Categorical Variable which tells whether house is renovated or not.
15. zipcode, lat, long: No clear relationship with price or any other feature.
16. living_measure15: Somewhat linear relationship with target feature. It's same as
living_measure. Therefore we can drop this variable.
17. lot_measure15: No clear relationship with price or any other feature.
18. furnished: No clear relationship with price or any other feature. 2 unique values so
can be converted to Categorical Variable
19. total_area: No clear relationship with price. But it has Very Strong linear
relationship with lot_measure. So one of it can be dropped.

In brief, below featues should be converted to Categorical Varia


ble

ceil, coast, sight, condition, quality, yr_renovated, furnis


hed

And below columns can be dropped after checking pearson factor

zipcode, lat, long, living_measure15, lot_measure15, total_a


rea

Now we will check corelation between the different features

Table in Jupyter notebook.


We have linear relationships in below featues as we got to know from above matrix

1. price: room_bath, living_measure, quality, living_measure15, furnished


2. living_measure: price, room_bath. So we can consider dropping 'room_bath' variable.
3. quality: price, room_bath, living_measure
4. ceil_measure: price, room_bath, living_measure, quality
5. living_measure15: price, living_measure, quality. So we can consider dropping
living_measure15 as well. As it's giving same info as living_measure.
6. lot_measure15: lot_measure. Therefore, we can consider dropping lot_measure15, as
it's giving same info.
7. furnished: quality
8. total_area: lot_measure, lot_measure15. Therefore, we can consider dropping total_area
feature as well. As it's giving same info as lot_measure.

38
House Price Prediction

We have plotted heatmap and can easily confirm our above findings

Analyzing Bivariate for Feature: month_year


month,year in which house is sold. Price is not influenced by it, though there are outliers and can be
easily seen.

39
House Price Prediction

40
House Price Prediction

The mean price of the houses tend to be high during March, April, May as compared to that of
September, October, November ,December period.

Analyzing Bivariate for Feature: room_bed

Room_bed - outliers can be seen easily. Mean and median of price increases with number
bedrooms/house up till a point and then drops.

41
House Price Prediction

There is clear increasing trend in price with room_bed¶

room_bath - outliers can be seen easily. Overall mean and median price increases with increasing
room_bath

There is upward trend in price with increase in room_bath

Analyzing Bivariate for Feature: living_measure

living_measure - price increases with increase in living measure

42
House Price Prediction

There is clear increment in price of the property with increment in the living measure But there
seems to be one outlier to this trend. Need to evaluate the same.

Analyzing Bivariate for Feature: lot_measure


lot_measure - there seems to be no relation between lot_measure and price

lot_measure - data value range is very large so breaking it get better view.

43
House Price Prediction

There doesnt seem to be no relation between lot_measure and price trend

Almost 95% of the houses have <25000 lot_measure. But there is no clear trend between
lot_measure and price

44
House Price Prediction

lot_measure >100000 - price increases with increase in living measure

Analyzing Bivariate for Feature: ceil


ceil - median price increases initially and then falls

45
House Price Prediction

There is some slight upward trend in price with the ceil.

Analyzing Bivariate for Feature: coast


coast - mean and median of waterfront view is high however such houses are very small in compare
to non-waterfront

Also, living_measure mean and median is greater for waterfront house.

46
House Price Prediction

The house properties with water_front tend to have higher price compared to that of non-
water_front properties

Analyzing Bivariate for Feature: sight

sight - have outliers. The house sighted more have high price (mean and median) and have large
living area as well.

47
House Price Prediction

Properties with higher price have more no.of sights compared to that of houses with lower price

Sight - Viewed in relation with price and living_measure

Costlier houses with large living area are sighted more.

The above graph also justify that: Properties with higher price have more no.of sights compared
to that of houses with lower price

Analyzing Bivariate for Feature: condition


condition - as the condition rating increases its price and living measure mean and median also
increases.

48
House Price Prediction

The price of the house increases with condition rating of the house
Condition - Viewed in relation with price and living_measure. Most houses are rated as 3 or more.
We can see some outliers as well.

49
House Price Prediction

So we found out that smaller houses are in better condition and better condition houses are
having higher prices

Analyzing Bivariate for Feature: quality


quality - with grade increase price and living_measure increase (mean and median)

There is clear increase in price of the house with higher rating on quality
quality - Viewed in relation with price and living_measure. Most houses are graded as 6 or more. We
can see some outliers as well

50
House Price Prediction

Analyzing Bivariate for Feature: ceil_measure


ceil_measure - price increases with increase in ceil measure

There is upward trend in price with ceil_measure

51
House Price Prediction

Analyzing Bivariate for Feature: basement

We will create the categorical variable for basement 'has_basement' for houses with basement
and no basement.This categorical variable will be used for further analysis.

Binning Basement to analyse data


basement - after binning we data shows with basement houses are costlier and have higher
living measure (mean & median)

The houses with basement has better price compared to that of houses without basement

basement - have higher price & living measure

52
House Price Prediction

yr_built - outliers can be seen easily.

We will create new variable: Houselandratio - This is proportion of living area in the total area of
the house. We will explore the trend of price against this houselandratio.

HouseLandRatio - Computing new variable as ratio of living_measure/total_area

Signifies - Land used for construction of house

53
House Price Prediction

Analyzing Bivariate for Feature: yr_renovated

So most houses are renovated after 1980's. We will create new categorical variable
'has_renovated' to categorize the property as renovated and non-renovated. For further
ananlysis we will use this categorical variable.

Lets try to group yr_renovated

#Binning Basement to analyse data

has_renovated - renovated have higher mean and median, however it does not confirm if the prices
of house renovated

#actually increased or not.

#HouseLandRatio - Renovated house utilized more land area for construction of house

54
House Price Prediction

Renovated properties have higher price than others with same living measure space.

Analyzing Bivariate for Feature: furnished


furnished - Furnished has higher price value and has greater living_measure

55
House Price Prediction

Furnished houses have higher price than that of the Non-furnished houses

56
House Price Prediction

DATA PROCESSING
Treating Outlilers

We have seen outliers for columns room_bath(33 bed),


living_measure, lot_measure, ceil_measure and Basement¶

ceil_measure

After treating outliers of ceil_measure, the data has reduced by about 600(~3%) data points but
data is nice.

Treating outliers for column - basement

We got 408 records as outliers, let's drop these outliers

57
House Price Prediction

After treating outliers of basement, we can see that 400(~2%) data


points got imputed. Total about 5% data has been imputed after
treating ceil_measure and basement.

Let's see the boxplot now for basement

Treating outliers for column - living_measure

We got 178 records as outliers. Let's treat this by dropping

let's see the boxplot after dropping the outliers

58
House Price Prediction

By treating outliers of living_measure, we lost 178 data points more


and data distribution looks normal

shape of the data after imputing outliers in living_column : (20416, 27)

Treating outliers for column - lot_measure

We got some records which are outliers. Let's drop these outlier records.’

let's plot after treating outliers

59
House Price Prediction

Total outliers in the lot_measure are 2124 data points. But still we are going ahead with imputing
the data. We will analyze later whether there is any impact on the data set or not

Treating outliers for column - room_bed

As we know for room_bed = 33 was outlier from our earlier findings, let's see the record and drop it

dropping the record from the dataset

In summary, after treating outliers, we have lost about 15% of the data. We will analyse the
impact of this data loss during the model evaluation.

4. Business insights from EDA


a) Is the data unbalanced? If so, what can be done? Please explain in the context of the
business b) Any business insights using clustering (if applicable) c) Any other business
insights

Answer:

The data provided was a mix of numerical and categorical variable to it was necessary to take that in
consideration.

We observed the presence of outliers that depicted that the data was unbalanced but we have
treated the outliers as well as the missing values.

Some errors showed that it could be some kind of data entry errors.
sometimes, business could be at loss due to such mistakes.

We can us either pincode data set to get an idea of the cities and the county it belongs to.

60

You might also like