BAA Class Notes
BAA Class Notes
BAA Class Notes
STATISTICS:
Definition: Science of collection, analysis, presentation and reasonable interpretation of
data.
Statistics is the art & science of learning from data:
Data Collection & Preparation
Data Analysis
Data Interpretation for Insights
Statistics is, first and foremost, a collection of tools, techniques and algorithms used for
converting raw data into useful information to help in decision making.
The goal of statistics is to help researchers organize and interpret the data and assist in
descriptive , predictive and prescriptive analysis.
"Univariate," "bivariate," and "multivariate" are terms commonly used in statistics and data
analysis to describe the number of variables involved in an analysis. Here's what each term
means:
Univariate:
Definition: Univariate refers to an analysis that involves only one variable. It focuses on the
distribution and properties of that single variable.
Example: If you were analyzing the heights of a group of people, a univariate analysis would
involve examining the distribution of heights without considering any other variables.
Bivariate:
Definition: Bivariate refers to an analysis that involves the relationship between two
variables. It explores how one variable changes in relation to another.
Example: If you were studying the relationship between hours of study and exam scores, a
bivariate analysis would involve looking at how changes in the number of study hours
correspond to changes in exam scores.
Multivariate:
Definition: Multivariate refers to an analysis that involves more than two variables. It
examines the simultaneous relationships among multiple variables.
Example: In a study examining factors influencing job satisfaction, you might consider
variables such as salary, work hours, and job responsibilities simultaneously. Analyzing all
these variables together constitutes a multivariate analysis.
A Variable is a characteristic or condition that can change or take on different values.
Example - Temperature,Age, Marital status, Height, Weight, IQ,etc.
A Variable can be classified as either
Categorical or Qualitative
Quantitative
Discrete
Continuous
Usually, populations are so large that a researcher cannot examine the entire group.
Therefore, a sample is selected to represent the population in a research study.
The goal is to use the results obtained from the sample to help answer questions about the
population.
Data Collection
Main methods for data collection:
Experiment: The investigator controls or modifies the environment and observes the effect
on the variable under study.
Census: A 100% survey. Every element of the population is listed. Seldom used: difficult and
time-consuming to compile, and expensive.
Observational study: Like experiments, observational studies attempt to understand cause-
and-effect relationships. However, unlike experiments, the researcher is not able to control
(1) how subjects are assigned to groups and/or (2) which treatments each group receives.
Definition: The IQR is a measure of statistical dispersion, specifically the range covered by
the middle 50% of the data. It is calculated as the difference between the third quartile (Q3)
and the first quartile (Q1).
Formula:
IQR=Q3−Q1
Interpretation: A larger IQR indicates a greater spread of the middle 50% of the data. The
IQR is less sensitive to extreme values (outliers) than the range.
Range:
Definition: The range is a measure of the spread of a dataset and represents the difference
between the maximum and minimum values.
Formula:
Maximum Value−Minimum Value
Range=Maximum Value−Minimum Value
Interpretation: The range provides a simple and intuitive measure of how much the data
values vary. However, it is sensitive to extreme values and may be influenced by outliers.
Comparison:
The IQR focuses on the middle 50% of the data, making it more robust against extreme
values or outliers. It provides a measure of the central tendency of the data.
The range considers the entire range of the data, from the minimum to the maximum value.
It is sensitive to outliers and may not accurately represent the central tendency of the data.
DEVIATION :
In statistics, deviation refers to the amount by which a single data point or a set of data
points differs from the mean (average) of the dataset. It provides a measure of how much
individual values vary from the central tendency represented by the mean. The deviation of
a data point (x) from the mean (μ) is calculated as:
{Deviation} = x - \mu \
Where:
- \(x\) is an individual data point.
- \(\mu\) is the mean of the dataset.
The deviation can be positive or negative, depending on whether the data point is above or
below the mean.
- **Positive Deviation:** If the data point is greater than the mean, the deviation is positive
(\(x - \mu > 0\)). This means the data point is above the average.
- **Negative Deviation:** If the data point is less than the mean, the deviation is negative (\
(x - \mu < 0\)). This means the data point is below the average.
Understanding the sign of the deviation is crucial because it provides information about the
direction of the difference from the mean. Positive deviations indicate values above the
mean, while negative deviations indicate values below the mean.
In statistical analysis, deviations are often squared (squared deviations) to avoid canceling
out positive and negative values when calculating variances and standard deviations. The
squared deviations are then averaged to obtain the variance, and the square root of the
variance gives the standard deviation.
Deviation is a fundamental concept in statistics and plays a key role in various statistical
measures and analyses. It helps quantify the spread or dispersion of data points from the
central tendency represented by the mean.
Formula for Standard Deviation
Standard deviation gives a measure of variation by
summarizing the deviations of each observation from the
mean and calculating an adjusted average of these deviations:
An observation is a potential outlier if it falls more than 1.5 x IQR below the first or more
than 1.5 x IQR above the third quartile.
Correlational Studies
The goal of a correlational study is to determine whether there is a relationship between
two variables and to describe the relationship.
A correlational study simply observes the two variables as they exist naturally.
Probability Distribution:
Random variable - A variable whose value is subjected to variations due to randomness.
Example - coin toss, throw of a dice,height of students in a class.
The mathematical function that describes this random behavior (the probabilities of
expected outcomes random variable can take) is known as probability distribution.
An antique mirror manufacturer sets the cost of their mirrors based in the manufacturing
cost, which is Rs 1800. However, the dealers thinks there are hidden costs and that the
average cost to manufacture the mirrors is actually much more. The dealers randomly
selects 40 mirrors and find that the mean cost to produce a mirror is Rs 1950 with a
standard deviation of Rs 500. Conduct a hypothesis test to see if this thought is true.
Alpha represents an acceptable probability of a Type I error in a statistical test.
H0 : μ <= 1800
Halt : μ > 1800
z = 1950 – 1800 / (500/√40) = 1.897
z score > critical value(1.645 for 0.05% alpha level)
Hence we reject the null hypothesis.
Statistical
Component Description Significance
Z-Test Determines if there's a significant Indicates whether observed differences are likely due
Statistical
Component Description Significance
difference between a sample statistic
and a known population parameter,
assuming a known population standard
deviation. to chance or are statistically significant.
Standardizes a data point's distance Helps compare data points from different distributions
from the mean in terms of standard and assess the relative position of individual
Z-Score deviations. observations.
A low p-value suggests that observed data is unlikely
Quantifies the probability of observing a under the null hypothesis, leading to the rejection of
result as extreme as, or more extreme the null hypothesis in favor of the alternative. The
than, the one obtained, assuming the smaller the p-value, the stronger the evidence against
P-Value null hypothesis is true. the null hypothesis.
Machine learning:
Inference of model/process that generated the data by analyzing the data itself
• Essential in :
– Descriptive analysis
• Rainfall pattern over years
– Predictive analysis
• Which areas are likely to receive rain given wind conditions
– Prescriptive analysis
• What crops should I harvest given the rain conditions
ML Paradigms:
Supervised Learning :
Learn a mapping from input to output
• Classification – categorical output
• Regression – continuous output
Unsupervised Learning :
Discover patterns in data
• Clustering – cohesive grouping
• Association rule mining – frequent cooccurrence and relation
Reinforcement Learning :
Learning control
Explanation table : Detailed :
Learning Type Definition Key Characteristics Examples
A type of machine learning - The algorithm learns - Classification: Predicting labels
where the algorithm is trained from labeled examples to (e.g., spam or not spam). -
Supervised on a labeled dataset, with make predictions on new, Regression: Predicting continuous
Learning input-output pairs. unseen data. values (e.g., house prices).
- Clustering: Grouping similar data
A type of machine learning points. - Dimensionality Reduction:
where the algorithm is given - The algorithm explores Simplifying data while retaining
unlabeled data and must find the inherent structure of important features. - Association:
Unsupervised patterns or relationships on the data without explicit Discovering relationships between
Learning its own. labels. variables.
A type of machine learning
where an agent learns by
interacting with an - The algorithm learns a
environment and receiving policy that maps states to - Game playing (e.g., AlphaGo). -
Reinforcement feedback in the form of actions to maximize Robotics control. - Autonomous
Learning rewards or penalties. cumulative rewards. vehicles.
USE PPT to understand more : mam has provided better understanding in ppt
LINEAR REGRESSION
A Linear Regression is a study of relationship between two or more variables.
Dependent or predicted or response variable is represented as ‘y’.
Independent variables or predictors are represented as ‘ x 1 ,x 2 , … x n ’.
‘y’ can be mapped as function of independent variables.
y = f ( x 1 ,x 2 , … x n )
Variables in regression:
Categorical input variables need to be converted to dummy variables.
Dependent variable in linear regression is quantitative variable.
Coding basics:
SALES DATA EXPLANATION OF CODE :
Certainly, let's break down the code and explain the purpose of each section:
1. **Imports:**
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
```
- **Explanation:** Providing information about the dataset, including data types and non-
null counts. Displaying summary statistics and checking for missing values.
5. **Data Visualization:**
```python
sns.boxplot(x='Sales', data=data)
sns.boxplot(x='TV', data=data)
sns.boxplot(x='Radio', data=data)
sns.boxplot(x='Newspaper', data=data)
```
6. **Outlier Treatment:**
```python
q1 = data["Newspaper"].quantile(0.25)
q3 = data["Newspaper"].quantile(0.75)
IQR = q3 - q1
lower_limit = q1 - (IQR * 1.5)
upper_limit = q3 + (IQR * 1.5)
data.loc[data["Newspaper"] < lower_limit, "Newspaper"] = lower_limit
data.loc[data["Newspaper"] > upper_limit, "Newspaper"] = upper_limit
```
- **Explanation:** Creating a boxplot again after outlier treatment to visualize the impact
on the 'Newspaper' column.
- **Explanation:** Filling missing values in the 'TV' and 'Radio' columns with their
respective mean values.
9. **Train-Test Split:**
```python
X = data[["TV", "Radio", "Newspaper"]]
y = data[["Sales"]]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3,
random_state=100)
```
- **Explanation:** Splitting the data into training and testing sets for building and
evaluating the linear regression model.
- **Explanation:** Training a linear regression model on the training data and printing the
coefficients.
- **Explanation:** Evaluating the model using mean squared error and R-squared on both
training and testing sets.
RAINFALL PREDICTION :
This Python notebook appears to be focused on rainfall prediction using machine learning.
Let's break down the code and understand the functions used:
2. **Ignoring Warnings:**
```python
import warnings
warnings.filterwarnings('ignore')
```
- This code suppresses warning messages to enhance the clarity of the notebook.
3. **Importing Libraries:**
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```
- These statements import essential libraries for data analysis and visualization: NumPy,
Pandas, Matplotlib, and Seaborn.
4. **Reading Data:**
```python
data = pd.read_csv("weather_data.csv")
```
- This line reads a CSV file named "weather_data.csv" into a Pandas DataFrame called
`data`.
5. **Exploring Data:**
```python
print(data.info())
print(data.describe())
```
- These lines provide information about the structure of the dataset, such as column
names, data types, and summary statistics.
```python
# Missing value treatment
# Mode imputation for categorical features
# Mean imputation for numerical features
```
7. **Outlier Treatment:**
```python
# Outlier treatment using IQR method
```
8. **Feature Engineering:**
```python
# Converting categorical variables to dummy variables
```
9. **Data Splitting:**
```python
# Splitting data into training and testing sets
```
Overall, the notebook covers data exploration, preprocessing, and the training and
evaluation of machine learning models for rainfall prediction. The models used include
Logistic Regression, Decision Trees, and Random Forest.