BAA Class Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

BAA Class Notes:

STATISTICS:
Definition: Science of collection, analysis, presentation and reasonable interpretation of
data.
Statistics is the art & science of learning from data:
Data Collection & Preparation
Data Analysis
Data Interpretation for Insights

Statistics is, first and foremost, a collection of tools, techniques and algorithms used for
converting raw data into useful information to help in decision making.

The goal of statistics is to help researchers organize and interpret the data and assist in
descriptive , predictive and prescriptive analysis.

Two areas of statistics:


Descriptive Statistics: collection, presentation, analysis, and description of sample data.
Inferential Statistics: making decisions and drawing generalized conclusions about
population (a larger dataset) based on sample data.

"Univariate," "bivariate," and "multivariate" are terms commonly used in statistics and data
analysis to describe the number of variables involved in an analysis. Here's what each term
means:

Univariate:

Definition: Univariate refers to an analysis that involves only one variable. It focuses on the
distribution and properties of that single variable.
Example: If you were analyzing the heights of a group of people, a univariate analysis would
involve examining the distribution of heights without considering any other variables.
Bivariate:

Definition: Bivariate refers to an analysis that involves the relationship between two
variables. It explores how one variable changes in relation to another.
Example: If you were studying the relationship between hours of study and exam scores, a
bivariate analysis would involve looking at how changes in the number of study hours
correspond to changes in exam scores.
Multivariate:

Definition: Multivariate refers to an analysis that involves more than two variables. It
examines the simultaneous relationships among multiple variables.
Example: In a study examining factors influencing job satisfaction, you might consider
variables such as salary, work hours, and job responsibilities simultaneously. Analyzing all
these variables together constitutes a multivariate analysis.
A Variable is a characteristic or condition that can change or take on different values.
Example - Temperature,Age, Marital status, Height, Weight, IQ,etc.
A Variable can be classified as either
Categorical or Qualitative
Quantitative
Discrete
Continuous

What is population & Sample:


The entire group of individuals under study is called the population.
For example, a researcher may be interested in the relation between class size (variable 1)
and academic performance (variable 2) for the population of third-grade children.

Usually, populations are so large that a researcher cannot examine the entire group.
Therefore, a sample is selected to represent the population in a research study.

The goal is to use the results obtained from the sample to help answer questions about the
population.

FOR VARIABLES REFER PPT.

Data Collection
Main methods for data collection:
Experiment: The investigator controls or modifies the environment and observes the effect
on the variable under study.
Census: A 100% survey. Every element of the population is listed. Seldom used: difficult and
time-consuming to compile, and expensive.
Observational study: Like experiments, observational studies attempt to understand cause-
and-effect relationships. However, unlike experiments, the researcher is not able to control
(1) how subjects are assigned to groups and/or (2) which treatments each group receives.

MEAN / MEDIAN / MODE :


MEAN : The mean is the sum of the observations divided by the number of observations

MEDIAN (MIDPOINT) : Midpoint of the observations


when ordered from least to
greatest
Order observations
If the number of observations is:
Odd, the median is the middle observation
Even, the median is the average of the two middle observations
Condition Skewness Kurtosis Interpretation
Data is symmetric with a bell-shaped
Normal Distribution Close to 0 Close to 3 (Mesokurtic) curve.
Left-Skewed Tail is longer on the left side.
Distribution Negative Less than 3 (Platykurtic) Distribution is flatter than normal.
Tail is longer on the right side.
Right-Skewed Greater than 3 Distribution has heavier tails than
Distribution Positive (Leptokurtic) normal.
Symmetrical Distribution is symmetric but not
Distribution (Non- Can be 0, but not Can vary (Mesokurtic or necessarily normal. Kurtosis can be
Normal) necessarily non-mesokurtic) different from 3.
Data is evenly spread across the range.
Uniform Distribution 0 Less than 3 (Platykurtic) Tails are lighter than normal.
Bimodal or Depends on the
Multimodal shape of individual Depends on the shape Skewness and kurtosis can vary based
Distribution modes of individual modes on the characteristics of each mode.

MODE : Value that occurs most often


Highest bar in the histogram
Mode is most often used with categorical data

IQR and Range :


The Interquartile Range (IQR) and the Range are both measures of the spread or dispersion
of a dataset, but they focus on different aspects of the data distribution.

Interquartile Range (IQR):

Definition: The IQR is a measure of statistical dispersion, specifically the range covered by
the middle 50% of the data. It is calculated as the difference between the third quartile (Q3)
and the first quartile (Q1).
Formula:
IQR=Q3−Q1
Interpretation: A larger IQR indicates a greater spread of the middle 50% of the data. The
IQR is less sensitive to extreme values (outliers) than the range.
Range:

Definition: The range is a measure of the spread of a dataset and represents the difference
between the maximum and minimum values.
Formula:
Maximum Value−Minimum Value
Range=Maximum Value−Minimum Value
Interpretation: The range provides a simple and intuitive measure of how much the data
values vary. However, it is sensitive to extreme values and may be influenced by outliers.

Comparison:
The IQR focuses on the middle 50% of the data, making it more robust against extreme
values or outliers. It provides a measure of the central tendency of the data.
The range considers the entire range of the data, from the minimum to the maximum value.
It is sensitive to outliers and may not accurately represent the central tendency of the data.

DEVIATION :

In statistics, deviation refers to the amount by which a single data point or a set of data
points differs from the mean (average) of the dataset. It provides a measure of how much
individual values vary from the central tendency represented by the mean. The deviation of
a data point (x) from the mean (μ) is calculated as:

{Deviation} = x - \mu \

Where:
- \(x\) is an individual data point.
- \(\mu\) is the mean of the dataset.

The deviation can be positive or negative, depending on whether the data point is above or
below the mean.

- **Positive Deviation:** If the data point is greater than the mean, the deviation is positive
(\(x - \mu > 0\)). This means the data point is above the average.

- **Negative Deviation:** If the data point is less than the mean, the deviation is negative (\
(x - \mu < 0\)). This means the data point is below the average.

Understanding the sign of the deviation is crucial because it provides information about the
direction of the difference from the mean. Positive deviations indicate values above the
mean, while negative deviations indicate values below the mean.

In statistical analysis, deviations are often squared (squared deviations) to avoid canceling
out positive and negative values when calculating variances and standard deviations. The
squared deviations are then averaged to obtain the variance, and the square root of the
variance gives the standard deviation.
Deviation is a fundamental concept in statistics and plays a key role in various statistical
measures and analyses. It helps quantify the spread or dispersion of data points from the
central tendency represented by the mean.
Formula for Standard Deviation
Standard deviation gives a measure of variation by
summarizing the deviations of each observation from the
mean and calculating an adjusted average of these deviations:

Empirical Rule: Magnitude of s


Criteria for Identifying an Outlier

An observation is a potential outlier if it falls more than 1.5 x IQR below the first or more
than 1.5 x IQR above the third quartile.

Correlational Studies
The goal of a correlational study is to determine whether there is a relationship between
two variables and to describe the relationship.

A correlational study simply observes the two variables as they exist naturally.

Machine Learning Fundamentals:


Characteristic Single-Sample Test Two-Sample Test
Compare a sample to a known or Compare two independent samples to assess if
hypothesized population parameter (mean, they come from populations with different
Purpose proportion, etc.). parameters.
One-sample t-test, one-sample proportion Independent samples t-test, two-sample
Example test. proportion test.
Typically involves comparing the sample
mean or proportion to a known or Involves comparing the means, proportions, or
Hypotheses hypothesized population parameter. variances of two independent samples.
Formulation of �0:�=�0H0:μ=μ0 or �0:�=�0H0 �0:�1=�2H0:μ1=μ2 or �0:�1=�2H0:p1
Hypotheses :p=p0 =p2
Uses t-statistic for means or z-statistic for
Often uses the t-statistic for means or z- proportions, comparing the difference between
Test Statistic statistic for proportions. sample means or proportions.
Typically involves �−1n−1 degrees of
Degrees of freedom for a t-test, where �n is the Involves degrees of freedom that depend on the
Freedom sample size. sample sizes of both groups.
Assumes the sample is representative of Assumes the samples are independent and may
the population and the data is normally involve assumptions about normality and equal
Assumption distributed. variances.
Testing if the average score of a sample of Testing if there is a significant difference in the
Example Use students is different from a known average scores of two independent groups of
Case population mean. students.
Provides a p-value indicating the Provides a p-value indicating the probability of
probability of observing the sample result observing the difference between the two
Output if the null hypothesis is true. samples if the null hypothesis is true.
Paired sample t-test for dependent samples,
Paired sample t-test for dependent analysis of variance (ANOVA) for comparing
Extensions samples. multiple groups.

Central Limit Theorem (CLT):


The central limit theorem states that if you have a population with mean μ and standard
deviation σ and you take sufficiently large number of random samples from the population
with replacement , then the distribution of the sample means will be approximately
normally distributed.

Probability Distribution:
Random variable - A variable whose value is subjected to variations due to randomness.
Example - coin toss, throw of a dice,height of students in a class.
The mathematical function that describes this random behavior (the probabilities of
expected outcomes random variable can take) is known as probability distribution.

TYPE 1 Error and Type 2 Error :


Error Type Definition Symbolism Explanation
The probability of observing a
Incorrectly significant result when there is no
rejecting a true null actual effect or difference. It
Type I Error hypothesis. �α (Alpha) represents a false positive.
The probability of not observing a
Incorrectly failing significant result when there is an
to reject a false null actual effect or difference. It
Type II Error hypothesis. �β (Beta) represents a false negative.
Symbolism - �α (Alpha): Level of significance. -
- �β (Beta): Probability of Type II error. -
In Type I error, you conclude there is an
effect or difference when there isn't (false
Explanation - positive). -
In Type II error, you fail to detect an
actual effect or difference (false
- negative). -
The balance between Type I and Type II
errors is often controlled by choosing the
significance level (�α). Lowering �α
- increases the risk of Type II error. -

EXAMPLE TO SOLVE HYPOTHESIS QUESTIONS :

An antique mirror manufacturer sets the cost of their mirrors based in the manufacturing
cost, which is Rs 1800. However, the dealers thinks there are hidden costs and that the
average cost to manufacture the mirrors is actually much more. The dealers randomly
selects 40 mirrors and find that the mean cost to produce a mirror is Rs 1950 with a
standard deviation of Rs 500. Conduct a hypothesis test to see if this thought is true.
Alpha represents an acceptable probability of a Type I error in a statistical test.
H0 : μ <= 1800
Halt : μ > 1800
z = 1950 – 1800 / (500/√40) = 1.897
z score > critical value(1.645 for 0.05% alpha level)
Hence we reject the null hypothesis.

Statistical
Component Description Significance
Z-Test Determines if there's a significant Indicates whether observed differences are likely due
Statistical
Component Description Significance
difference between a sample statistic
and a known population parameter,
assuming a known population standard
deviation. to chance or are statistically significant.
Standardizes a data point's distance Helps compare data points from different distributions
from the mean in terms of standard and assess the relative position of individual
Z-Score deviations. observations.
A low p-value suggests that observed data is unlikely
Quantifies the probability of observing a under the null hypothesis, leading to the rejection of
result as extreme as, or more extreme the null hypothesis in favor of the alternative. The
than, the one obtained, assuming the smaller the p-value, the stronger the evidence against
P-Value null hypothesis is true. the null hypothesis.

Shapiro Wilk test :


The Shapiro-Wilk test is a statistical test used to check if a random variable follows a
normal distribution.
The null hypothesis (H0) states that the variable is normally distributed, and the
alternative hypothesis (H1) states that the variable is NOT normally distributed. So after
running this test:
● If p ≤ 0.05: then the null hypothesis can be rejected (i.e. decision is that the
variable is NOT normally distributed).
● If p > 0.05: then the null hypothesis cannot be rejected (i.e. decision is that the the
variable MAY BE normally distributed).

Durbin Watson test:


The Durbin Watson (DW) statistic is a test for autocorrelation in the residuals(error terms)
from a statistical model or regression analysis. The Durbin-Watson statistic will always
have a value ranging between 0 and 4. A value of 2.0 indicates there is no autocorrelation
detected in the sample. Values from 0 to less than 2 point to positive autocorrelation and
values from 2 to 4 means negative autocorrelation.
Autocorrelation - Autocorrelation is a mathematical representation of the degree of
similarity between a given time series and a lagged version of itself over successive time
intervals.Example - If it rained yesterday, it will rain today.

Machine learning:
Inference of model/process that generated the data by analyzing the data itself
• Essential in :
– Descriptive analysis
• Rainfall pattern over years
– Predictive analysis
• Which areas are likely to receive rain given wind conditions
– Prescriptive analysis
• What crops should I harvest given the rain conditions
ML Paradigms:
Supervised Learning :
Learn a mapping from input to output
• Classification – categorical output
• Regression – continuous output
Unsupervised Learning :
Discover patterns in data
• Clustering – cohesive grouping
• Association rule mining – frequent cooccurrence and relation
Reinforcement Learning :
Learning control
Explanation table : Detailed :
Learning Type Definition Key Characteristics Examples
A type of machine learning - The algorithm learns - Classification: Predicting labels
where the algorithm is trained from labeled examples to (e.g., spam or not spam). -
Supervised on a labeled dataset, with make predictions on new, Regression: Predicting continuous
Learning input-output pairs. unseen data. values (e.g., house prices).
- Clustering: Grouping similar data
A type of machine learning points. - Dimensionality Reduction:
where the algorithm is given - The algorithm explores Simplifying data while retaining
unlabeled data and must find the inherent structure of important features. - Association:
Unsupervised patterns or relationships on the data without explicit Discovering relationships between
Learning its own. labels. variables.
A type of machine learning
where an agent learns by
interacting with an - The algorithm learns a
environment and receiving policy that maps states to - Game playing (e.g., AlphaGo). -
Reinforcement feedback in the form of actions to maximize Robotics control. - Autonomous
Learning rewards or penalties. cumulative rewards. vehicles.

USE PPT to understand more : mam has provided better understanding in ppt

LINEAR REGRESSION
A Linear Regression is a study of relationship between two or more variables.
Dependent or predicted or response variable is represented as ‘y’.
Independent variables or predictors are represented as ‘ x 1 ,x 2 , … x n ’.
‘y’ can be mapped as function of independent variables.
y = f ( x 1 ,x 2 , … x n )

Variables in regression:
Categorical input variables need to be converted to dummy variables.
Dependent variable in linear regression is quantitative variable.

How salaries of employees get affected by years of experience, past performance ,


education, previous salary ,etc.
How customer spend is effected by number of advertisements, sales calls made,number of
competitors,etc.
How speed and number of breaks affect car fuel efficiency.

The PPt’s of respective topics will provide better clarity :

Coding basics:
SALES DATA EXPLANATION OF CODE :
Certainly, let's break down the code and explain the purpose of each section:

1. **Imports:**
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
```

- **Explanation:** Importing necessary libraries for data manipulation, visualization, and


connecting to Google Drive in a Google Colab environment.

2. **Mount Google Drive:**


```python
drive.mount('/content/drive')
```
- **Explanation:** Mounting Google Drive to access files stored in Google Drive.

3. **Loading and Displaying Data:**


```python
data = pd.DataFrame(pd.read_csv("advertising.csv"))
data.head()
data.shape
```
- **Explanation:** Loading a dataset named "advertising.csv" into a Pandas DataFrame.
Displaying the first few rows and checking the dataset's shape.

4. **Exploratory Data Analysis (EDA):**


```python
data.info()
data.describe()
data.isnull().sum()
```

- **Explanation:** Providing information about the dataset, including data types and non-
null counts. Displaying summary statistics and checking for missing values.

5. **Data Visualization:**
```python
sns.boxplot(x='Sales', data=data)
sns.boxplot(x='TV', data=data)
sns.boxplot(x='Radio', data=data)
sns.boxplot(x='Newspaper', data=data)
```

- **Explanation:** Creating boxplots to visualize the distribution and identify potential


outliers in the 'Sales', 'TV', 'Radio', and 'Newspaper' columns.

6. **Outlier Treatment:**
```python
q1 = data["Newspaper"].quantile(0.25)
q3 = data["Newspaper"].quantile(0.75)
IQR = q3 - q1
lower_limit = q1 - (IQR * 1.5)
upper_limit = q3 + (IQR * 1.5)
data.loc[data["Newspaper"] < lower_limit, "Newspaper"] = lower_limit
data.loc[data["Newspaper"] > upper_limit, "Newspaper"] = upper_limit
```

- **Explanation:** Applying outlier treatment to the 'Newspaper' column by capping


values below the lower limit and above the upper limit.

7. **More Data Visualization:**


```python
sns.boxplot(x='Newspaper', data=data)
```

- **Explanation:** Creating a boxplot again after outlier treatment to visualize the impact
on the 'Newspaper' column.

8. **Handling Missing Values:**


```python
for feature in ["TV", "Radio"]:
mean_value = data[feature].mean()
data[feature].fillna(mean_value, inplace=True)
```

- **Explanation:** Filling missing values in the 'TV' and 'Radio' columns with their
respective mean values.

9. **Train-Test Split:**
```python
X = data[["TV", "Radio", "Newspaper"]]
y = data[["Sales"]]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3,
random_state=100)
```

- **Explanation:** Splitting the data into training and testing sets for building and
evaluating the linear regression model.

10. **Linear Regression Model:**


```python
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)
print("Coefficients: \n", model.coef_)
```

- **Explanation:** Training a linear regression model on the training data and printing the
coefficients.

11. **Model Evaluation:**


```python
from sklearn.metrics import mean_squared_error, r2_score
mean_squared_error(y_train, model.predict(X_train))
r2_score(y_train, model.predict(X_train))
mean_squared_error(y_test, preds)
r2_score(y_test, preds)
```

- **Explanation:** Evaluating the model using mean squared error and R-squared on both
training and testing sets.
RAINFALL PREDICTION :
This Python notebook appears to be focused on rainfall prediction using machine learning.
Let's break down the code and understand the functions used:

1. **Mounting Google Drive:**


```python
from google.colab import drive
drive.mount('/content/drive')
```
- This code mounts the Google Drive to the Colab environment. It's a common step in
Google Colab to access files stored in Google Drive.

2. **Ignoring Warnings:**
```python
import warnings
warnings.filterwarnings('ignore')
```
- This code suppresses warning messages to enhance the clarity of the notebook.

3. **Importing Libraries:**
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```
- These statements import essential libraries for data analysis and visualization: NumPy,
Pandas, Matplotlib, and Seaborn.

4. **Reading Data:**
```python
data = pd.read_csv("weather_data.csv")
```
- This line reads a CSV file named "weather_data.csv" into a Pandas DataFrame called
`data`.

5. **Exploring Data:**
```python
print(data.info())
print(data.describe())
```
- These lines provide information about the structure of the dataset, such as column
names, data types, and summary statistics.

6. **Handling Missing Values:**


```python
data.drop('Date', axis=1, inplace=True)
```
- This removes the 'Date' column from the dataset.

```python
# Missing value treatment
# Mode imputation for categorical features
# Mean imputation for numerical features
```

7. **Outlier Treatment:**
```python
# Outlier treatment using IQR method
```

8. **Feature Engineering:**
```python
# Converting categorical variables to dummy variables
```

9. **Data Splitting:**
```python
# Splitting data into training and testing sets
```

10. **Logistic Regression Model:**


```python
# Training a Logistic Regression model
```

11. **Model Evaluation:**


```python
# ROC curve and confusion matrix for Logistic Regression
```

12. **Decision Tree Model:**


```python
# Training a Decision Tree model
```

13. **Model Evaluation (Decision Tree):**


```python
# ROC curve and confusion matrix for Decision Tree
```

14. **Random Forest Model:**


```python
# Training a Random Forest model
```

15. **Model Evaluation (Random Forest):**


```python
# ROC curve and confusion matrix for Random Forest
```

Overall, the notebook covers data exploration, preprocessing, and the training and
evaluation of machine learning models for rainfall prediction. The models used include
Logistic Regression, Decision Trees, and Random Forest.

You might also like