0% found this document useful (0 votes)
50 views

Machine Learning: Assignment-1

The document discusses predicting housing prices using machine learning models. [1] It uses the Boston housing dataset which contains 13 explanatory variables related to housing in Boston. [2] The dependent variable is price and independent variables include crime rates, distances to employment, tax rates, and more. [3] The goal is to create a regression model that accurately estimates housing prices based on the features.

Uploaded by

Isha Aggarwal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Machine Learning: Assignment-1

The document discusses predicting housing prices using machine learning models. [1] It uses the Boston housing dataset which contains 13 explanatory variables related to housing in Boston. [2] The dependent variable is price and independent variables include crime rates, distances to employment, tax rates, and more. [3] The goal is to create a regression model that accurately estimates housing prices based on the features.

Uploaded by

Isha Aggarwal
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

MACHINE LEARNING

Assignment-1

MARCH 28, 2023


ISHA AGGARWAL
2022-2305-0001-0001
PGDM, SEM-II

Batch: 2022 -2024

Subject: Machine Learning-I

Housing prices are an important reflection of the economy, and housing price ranges are of great interest for both
buyers and sellers. House prices will be predicted given explanatory variables that cover many aspects of residential
houses. The goal of this case study is to create a regression model that is able to accurately estimate the price of the
house given the features.

Data Set: Boston Data set

Data description.

1. CRIM per capital crime rate by town

2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.

3. INDUS proportion of non-retail business acres per town

4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

5. NOX nitric oxides concentration (parts per 10 million)

6. RM average number of rooms per dwelling

7. AGE proportion of owner-occupied units built prior to 1940

8. DIS weighted distances to five Boston employment centers

9. RAD index of accessibility to radial highways

10.TAX full-value property-tax rate per 10,000 USD

11. PTRATIO pupil-teacher ratio by town

12. Black 1000(Bk — 0.63)² where Bk is the proportion of blacks by town

13. LSTAT % lower status of the population

Below is the detail analysis performed on boston data set. Observe the analysis and answer the questions given below:
Q.1) State the use of following libraries [5
mark]

i) Pandas ii) numpy iii) seaborn iv) matplotlib v) sklearn

1) Pandas: Pandas is a Python library used for working with data. It has functions for analyzing,
cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data",
and "Python Data Analysis" and was created by Wes McKinney in 2008.
Use of Pandas:

 Pandas allows us to analyse big data and make conclusions based on statistical theories.
 Pandas can clean messy data sets, and make them readable and relevant.
 Relevant data is very important in data science.

Do:

Pandas gives you answers about the data. Like:

 Is there a correlation between two or more columns?


 What is average value?
 Max value?
 Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or
NULL values. This is called cleaning the data.
2) Numpy: It is a general-purpose package for array processing. It offers a multidimensional array object
with outstanding speed as well as capabilities for interacting with these arrays. It is the cornerstone
Python module for scientific computing. In addition to its apparent scientific applications, Numpy is a
powerful multi-dimensional data container.

A table of elements (often numbers) of the same type, all indexed by a tuple of positive integers, is
what Numpy refers to as an array. The rank of the array in Numpy refers to the array's number of
dimensions. The shape of the array is defined as a tuple of numbers that indicates the size of the array
along each axis. Ndarray is the name of an array class in Numpy. Square brackets are used to access
elements in Numpy arrays, and nested Python Lists can be used to populate the arrays.

3) Seaborn: It is a Python module mostly used for statistical graphing. It is based on Matplotlib and
offers lovely default styles and colour palettes to enhance the aesthetic appeal of statistical displays.

With Seaborn, visualisation will be at the heart of data exploration and comprehension. For a better
comprehension of the dataset, it offers dataset-oriented APIs that allow us to switch between various
visual representations for the same variables.

Plots are primarily used to show how different variables relate to one another. These variables may be
entirely numerical or may represent a category, such as a group, class, or division. Seaborn categorises
the plot into the following groups:
Relational plots: This plot is used to understand the relation between two variables.
 Categorical Plot: This plot deals with categorical variables and how they can be visualized.
 Distribution plots: This plot is used for examining univariate and bivariate distributions
 Regression plots: The regression plots in seaborn are primarily intended to add a visual guide that
helps to emphasize patterns in a dataset during exploratory data analyses.

4) Matplotlib: Matplotlib is an amazing visualization library in Python for 2D plots of arrays.


Matplotlib is a multi-platform data visualization library built on NumPy arrays and designed to work
with the broader SciPy stack. It was introduced by John Hunter in the year 2002. One of the greatest
benefits of visualization is that it allows us visual access to huge amounts of data in easily digestible
visuals. Matplotlib consists of several plots like line, bar, scatter, histogram etc.

Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to
make correlations. They’re typically instruments for reasoning about quantitative information.  
5) Sklearn: Scikit-learn is mostly written in Python and significantly makes use of the NumPy module
for computations involving arrays and linear algebra. To further the effectiveness of this library, some
basic algorithms are also written in Cython.

Other other Python programmes, including as SciPy, Pandas data frames, NumPy for array
vectorization, Matplotlib, seaborn, and plotly for graphing, among others, play well with Scikit-learn.
Whether you're looking for an ML overview, want to come up to speed quickly, or want the latest
recent ML learning tool, scikit-learn is well-documented and simple to grasp. You may easily create a
predictive data analysis model and apply it to fit the gathered data with the aid of this high-level
toolbox. It is flexible and gets along nicely with other Python libraries.
Que.2) Identify the dependent and independent variable. [2 Mark]

i) Dependent Variables: Sales & Price

ii) Independent Variables:

a. CRIM per capital crime rate by town


b. ZN proportion of residential land zoned for lots over 25,000 sq. Ft.
c. INDUS proportion of non-retail business acres per town
d. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
e. NOX nitric oxides concentration (parts per 10 million)
f. RM average number of rooms per dwelling 7.
g. AGE proportion of owner-occupied units built prior to 1940 8.
h. DIS weighted distances to five Boston employment centres
i. RAD index of accessibility to radial highways
j. TAX full-value property-tax rate per 10,000 USD
k. PTRATIO pupil-teacher ratio by town
l. Black 1000(Bk — 0.63) ² where Bk is the proportion of blacks by town
m. LSTAT % lower status of the population.

Que.3) Which variables are strongly related to the price variable? Comment on their values. [4
Marks]

 Price and RM has strong positive relation and value is 0.695


 Price and LSTAT has strong negative relation and value is - 0.737

Que.4) Observe above heatmap and comment on your findings.

 RAD and Tax has strong positive relation and value is 0.91
 INDUS and Tax has strong positive relation and value is 0.72
 Tax and NOX has strong positive relation and value is 0.668
 Price and RM has strong positive relation and value is 0.695
[4 marks]

Que.5) Observe the following output of regression variable and answer the following sub-questions.

i) Write the regression model ( y = mx +C)? [3 mark]


ii) How much variation is explained by fitted model? [2 Mark]
iii) Which method is used to estimate the parameters? [2 Mark]
iv) Estimate the value of ‘PRICE’ for the value LSTAT=0.6456.

[3 Mark]

i)

y = c + mx

y = Sales

m = Price

m = - 0.8809
c = 33.3178

Sales = 33.3178 – 0.8809 * Price

ii)

R2 = 0.550

Adj. R2 = 0.545

Adjusted R2 This measures the variation for a multiple regression model, and helps you determine goodness
of fit.

So, here 0.545 variation is explained by this fitted model.

iii) We use Ordinary List Square Method (OLS) is use to estimate the parameters.

In the case of a model with p explanatory variables, the OLS regression model writes:

Y = β0 + Σj=1..p βjXj + ε

where Y is the dependent variable, β0, is the intercept of the model, X j corresponds to the jth explanatory
variable of the model (j= 1 to p), and e is the random error with expectation 0 and variance σ².

iv)

y = c + mx

m = - 0.8809

x(LSTAT) = 0.6456

c = 33.3178

Price(y) = 33.3178 + (- 0.8809) * 0.6456

= 33.3178 – 0.5678
= 32.7490

Que 6) Observe the following residual plot and comment on it. [5 Mark]

Are they scattered randomly around Or, are they clustered.

In this residual plot, there is a pattern that you can describe. The data points are above the residual=0 line.
Then, we detect all of the data points under the residual=0 line. The next data points are again clustered on or
above the residual line=0. Maximum points are clustered in one area. Since there is a detectable pattern in the
residual plot, we conclude that a linear model is not a right fit for the data.

You might also like