Machine Learning: Assignment-1
Machine Learning: Assignment-1
Assignment-1
Housing prices are an important reflection of the economy, and housing price ranges are of great interest for both
buyers and sellers. House prices will be predicted given explanatory variables that cover many aspects of residential
houses. The goal of this case study is to create a regression model that is able to accurately estimate the price of the
house given the features.
Data description.
Below is the detail analysis performed on boston data set. Observe the analysis and answer the questions given below:
Q.1) State the use of following libraries [5
mark]
1) Pandas: Pandas is a Python library used for working with data. It has functions for analyzing,
cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data",
and "Python Data Analysis" and was created by Wes McKinney in 2008.
Use of Pandas:
Pandas allows us to analyse big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
Do:
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or
NULL values. This is called cleaning the data.
2) Numpy: It is a general-purpose package for array processing. It offers a multidimensional array object
with outstanding speed as well as capabilities for interacting with these arrays. It is the cornerstone
Python module for scientific computing. In addition to its apparent scientific applications, Numpy is a
powerful multi-dimensional data container.
A table of elements (often numbers) of the same type, all indexed by a tuple of positive integers, is
what Numpy refers to as an array. The rank of the array in Numpy refers to the array's number of
dimensions. The shape of the array is defined as a tuple of numbers that indicates the size of the array
along each axis. Ndarray is the name of an array class in Numpy. Square brackets are used to access
elements in Numpy arrays, and nested Python Lists can be used to populate the arrays.
3) Seaborn: It is a Python module mostly used for statistical graphing. It is based on Matplotlib and
offers lovely default styles and colour palettes to enhance the aesthetic appeal of statistical displays.
With Seaborn, visualisation will be at the heart of data exploration and comprehension. For a better
comprehension of the dataset, it offers dataset-oriented APIs that allow us to switch between various
visual representations for the same variables.
Plots are primarily used to show how different variables relate to one another. These variables may be
entirely numerical or may represent a category, such as a group, class, or division. Seaborn categorises
the plot into the following groups:
Relational plots: This plot is used to understand the relation between two variables.
Categorical Plot: This plot deals with categorical variables and how they can be visualized.
Distribution plots: This plot is used for examining univariate and bivariate distributions
Regression plots: The regression plots in seaborn are primarily intended to add a visual guide that
helps to emphasize patterns in a dataset during exploratory data analyses.
Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to
make correlations. They’re typically instruments for reasoning about quantitative information.
5) Sklearn: Scikit-learn is mostly written in Python and significantly makes use of the NumPy module
for computations involving arrays and linear algebra. To further the effectiveness of this library, some
basic algorithms are also written in Cython.
Other other Python programmes, including as SciPy, Pandas data frames, NumPy for array
vectorization, Matplotlib, seaborn, and plotly for graphing, among others, play well with Scikit-learn.
Whether you're looking for an ML overview, want to come up to speed quickly, or want the latest
recent ML learning tool, scikit-learn is well-documented and simple to grasp. You may easily create a
predictive data analysis model and apply it to fit the gathered data with the aid of this high-level
toolbox. It is flexible and gets along nicely with other Python libraries.
Que.2) Identify the dependent and independent variable. [2 Mark]
Que.3) Which variables are strongly related to the price variable? Comment on their values. [4
Marks]
RAD and Tax has strong positive relation and value is 0.91
INDUS and Tax has strong positive relation and value is 0.72
Tax and NOX has strong positive relation and value is 0.668
Price and RM has strong positive relation and value is 0.695
[4 marks]
Que.5) Observe the following output of regression variable and answer the following sub-questions.
[3 Mark]
i)
y = c + mx
y = Sales
m = Price
m = - 0.8809
c = 33.3178
ii)
R2 = 0.550
Adj. R2 = 0.545
Adjusted R2 This measures the variation for a multiple regression model, and helps you determine goodness
of fit.
iii) We use Ordinary List Square Method (OLS) is use to estimate the parameters.
In the case of a model with p explanatory variables, the OLS regression model writes:
Y = β0 + Σj=1..p βjXj + ε
where Y is the dependent variable, β0, is the intercept of the model, X j corresponds to the jth explanatory
variable of the model (j= 1 to p), and e is the random error with expectation 0 and variance σ².
iv)
y = c + mx
m = - 0.8809
x(LSTAT) = 0.6456
c = 33.3178
= 33.3178 – 0.5678
= 32.7490
Que 6) Observe the following residual plot and comment on it. [5 Mark]
In this residual plot, there is a pattern that you can describe. The data points are above the residual=0 line.
Then, we detect all of the data points under the residual=0 line. The next data points are again clustered on or
above the residual line=0. Maximum points are clustered in one area. Since there is a detectable pattern in the
residual plot, we conclude that a linear model is not a right fit for the data.