0% found this document useful (0 votes)

31 views16 pages

2a EDA

Uploaded by

GOVINDARAJ

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Download as docx, pdf, or txt

0% found this document useful (0 votes)

31 views16 pages

2a EDA

Uploaded by

GOVINDARAJ

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Download as docx, pdf, or txt

You are on page 1/ 16

2a.

Exploratory Data Analysis

Instructions:
Please share your answers filled in-line in the word document. Submit code
separately wherever applicable.

Please ensure you update all the details:

Name: _____________ Batch ID: ___________
Topic: Exploratory Data Analysis

Guidelines:
1. An assignment submission is considered complete only when the correct and executable code(s) is
submitted along with the documentation explaining the method and results. Failing to submit either
of those will be considered an invalid submission and will not be considered a correct submission.

2. Ensure that you submit your assignments correctly. Resubmission is not allowed.

3. Post the submission you can evaluate your work by referring to the keys provided. (will be available
only post the submission).
dictionary as displayed in the image below:

Hints: Follow CRISP-ML(Q) methodology steps, where were appropriate.

1. Data Understanding: work on each feature of the dataset to create a data

Make a table as shown above and provide information about the features such as its data
type and its relevance to the model building. And if not relevant, provide reasons and a
description of the feature.

Problem Statements:

© 360DigiTMG. All Rights Reserved.

Q1) Calculate Mean, and Standard Deviation using Python code & draw inferences on the
following data. Refer to the Datasets attachment for the data file.
Hint: [Insights drawn from the data such as data is normally distributed/not, outliers, measures
like mean, median, mode, variance, std. deviation]
a. Car’s speed and distance

import pandas as pd

data = pd.read_csv("D:\DATA SET\Q1_a.csv")

© 360DigiTMG. All Rights Reserved.

print("First few rows of the dataset:")
print(data.head())

mean_speed = data['speed'].mean()
std_dev_speed = data['speed'].std()

mean_distance = data['dist'].mean()
std_dev_distance = data['dist'].std()

print("\nStatistics for 'speed':")

print(f"Mean: {mean_speed}")
print(f"Standard Deviation: {std_dev_speed}")

print("\nStatistics for 'distance':")

print(f"Mean: {mean_distance}")
print(f"Standard Deviation: {std_dev_distance}")

b. Top Speed (SP) and Weight (WT)

© 360DigiTMG. All Rights Reserved.

import pandas as pd

© 360DigiTMG. All Rights Reserved.

data = pd.read_csv("D:\DATA SET\Q1_b.csv")

print("First few rows of the dataset:")

print(data.head())

mean_sp = data['SP'].mean()
std_dev_sp = data['SP'].std()

mean_wt = data['WT'].mean()
std_dev_wt = data['WT'].std()

print("\nStatistics for 'SP' (Top Speed):")

print(f"Mean: {mean_sp}")
print(f"Standard Deviation: {std_dev_sp}")

print("\nStatistics for 'WT' (Weight):")

print(f"Mean: {mean_wt}")
print(f"Standard Deviation: {std_dev_wt}")

Q2) Below are the scores obtained by a student on tests.

34, 36, 36, 38, 38, 39, 39, 40, 40, 41, 41, 41, 41, 42, 42, 45, 49, 56
1) Find the mean, median and mode, variance, and standard deviation.
import numpy as np

© 360DigiTMG. All Rights Reserved.

from scipy import stats

scores = [34, 36, 36, 38, 38, 39, 39, 40, 40, 41, 41, 41, 41, 42, 42, 45, 49, 56]

mean = np.mean(scores)

median = np.median(scores)

mode = stats.mode(scores)[0][0]

variance = np.var(scores)

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")

2) What can we say about the student marks?

import numpy as np
from scipy import stats

© 360DigiTMG. All Rights Reserved.

scores = [34, 36, 36, 38, 38, 39, 39, 40, 40, 41, 41, 41, 41, 42, 42, 45, 49, 56]

std_dev = np.std(scores)

print(f"Standard Deviation: {std_dev}")

3) What can you say about the Excepted value for the student score?

Q3) Three Coins are tossed, find the probability that two heads and one tail are obtained.

import numpy as np

num_trials = 1000000 # You can adjust this number based on desired accuracy

results = np.random.randint(0, 2, size=(num_trials, 3)) # 0 represents Tail, 1 represents Head

count_two_heads_one_tail = np.sum((results.sum(axis=1) == 2) & (results[:, 0] == 0))

probability = count_two_heads_one_tail / num_trials

print(f"Probability of getting two heads and one tail: {probability}")

© 360DigiTMG. All Rights Reserved.

Q4) Two Dice are rolled, find the probability that the sum is
a) Equal to 1
b) Less than or equal to 4
c) Sum is divisible by 2 and 3
import random

def roll_dice():
"""Simulates rolling two dice and returns their sum."""
return random.randint(1, 6) + random.randint(1, 6)

def main():

simulations = 10000

count_equal_to_1 = 0
count_less_than_or_equal_to_4 = 0
count_divisible_by_2_and_3 = 0

for _ in range(simulations):
sum = roll_dice()
if sum == 1:
count_equal_to_1 += 1
if sum <= 4:
count_less_than_or_equal_to_4 += 1
if sum % 2 == 0 and sum % 3 == 0:
count_divisible_by_2_and_3 += 1

© 360DigiTMG. All Rights Reserved.

probability_equal_to_1 = count_equal_to_1 / simulations
probability_less_than_or_equal_to_4 = count_less_than_or_equal_to_4 / simulations
probability_divisible_by_2_and_3 = count_divisible_by_2_and_3 / simulations

print("a) Equal to 1:", probability_equal_to_1)

print("b) Less than or equal to 4:", probability_less_than_or_equal_to_4)
print("c) Sum is divisible by 2 and 3:", probability_divisible_by_2_and_3)

if __name__ == "__main__":
main()
Q5) A bag contains 2 red, 3 green, and 2 blue balls. Two balls are drawn at random. What is the
probability that none of the balls drawn is blue?

import math

total_red = 2
total_green = 3
total_blue = 2
total_balls = total_red + total_green + total_blue

ways_no_blue = math.comb(total_red + total_green, 2)

© 360DigiTMG. All Rights Reserved.

total_ways = math.comb(total_balls, 2)

probability_no_blue = ways_no_blue / total_ways

print(f"The probability that none of the balls drawn is blue is {probability_no_blue:.4f}")

Q6) Calculate the Expected number of candies for a randomly selected child:
Below are the probabilities of the count of candies for children (ignoring the nature of the child-
Generalized view)
i. Child A – the probability of having 1 candy is 0.015.
ii. Child B – the probability of having 4 candies is 0.2.

CHILD Candies count Probability

A 1 0.015
B 4 0.20
C 3 0.65
D 5 0.005
E 6 0.01
F 2 0.12

Q7) Calculate Mean, Median, Mode, Variance, Standard Deviation, and Range & comment
about the values / draw inferences, for the given dataset.
- For Points, Score, Weigh>
Find Mean, Median, Mode, Variance, Standard Deviation, and Range and comment on the
values/ Draw some inferences.

© 360DigiTMG. All Rights Reserved.

Dataset: Refer to Hands-on Material in LMS - Data Types EDA assignment snapshot of the
dataset is given above.

Q8) Calculate the Expected Value for the problem below.

a) The weights (X) of patients at a clinic (in pounds), are.
108, 110, 123, 134, 135, 145, 167, 187, 199
Assume one of the patients is chosen at random. What is the Expected Value of the
Weight of that patient?

weights = [108, 110, 123, 134, 135, 145, 167, 187, 199]

# Number of patients
n = len(weights)

# Calculate the sum of weights

sum_weights = sum(weights)

# Calculate the Expected Value

expected_value = sum_weights / n

© 360DigiTMG. All Rights Reserved.

# Print the Expected Value
print(f"The Expected Value of the weight of a randomly chosen patient is:
{expected_value:.2f} pounds")

Q9) Look at the data given below. Plot the data, find the outliers, and find out: μ , σ , σ 2
Hint: [Use a plot that shows the data distribution, and skewness along with the outliers; also
use Python code to evaluate measures of centrality and spread]

Name of company Measure X

Allied Signal 24.23%
Bankers Trust 25.53%
General Mills 25.41%
ITT Industries 24.14%
J.P.Morgan & Co. 29.62%
Lehman Brothers 28.25%
Marriott 25.81%
MCI 24.39%
Merrill Lynch 40.26%
Microsoft 32.95%
Morgan Stanley 91.36%
Sun Microsystems 25.99%
Travelers 39.42%
US Airways 26.71%
Warner-Lambert 35.00%

import matplotlib.pyplot as plt

import numpy as np

# Company names and their Measure X values

companies = [
"Allied Signal", "Bankers Trust", "General Mills", "ITT Industries",
"J.P.Morgan & Co.", "Lehman Brothers", "Marriott", "MCI",

"Merrill Lynch", "Microsoft", "Morgan Stanley", "Sun Microsystems",
"Travelers", "US Airways", "Warner-Lambert"
]

measure_x = [
24.23, 25.53, 25.41, 24.14, 29.62, 28.25, 25.81, 24.39,
40.26, 32.95, 91.36, 25.99, 39.42, 26.71, 35.00
]

# Convert measure_x to numpy array for easier statistical calculations

measure_x = np.array(measure_x)

# Plot boxplot to identify outliers

plt.figure(figsize=(10, 6))
plt.boxplot(measure_x, vert=False)
plt.yticks([])
plt.title('Boxplot of Measure X')
plt.xlabel('Measure X (%)')
plt.grid(True)
plt.show()

# Calculate mean, standard deviation, and variance

mean = np.mean(measure_x)
std_dev = np.std(measure_x)
variance = np.var(measure_x)

print(f"Mean (μ): {mean:.2f}%")

print(f"Standard Deviation (σ): {std_dev:.2f}%")
print(f"Variance (σ^2): {variance:.2f}%")

Q10) AT&T was running commercials in 1990 aimed at luring back customers who had switched
to one of the other long-distance phone service providers. One such commercial shows a
businessman trying to reach Phoenix and mistakenly getting Fiji, where a half-naked native on a
beach responds incomprehensibly in Polynesian. When asked about this advertisement, AT&T
admitted that the portrayed incident did not actually take place but added that this was an
enactment of something that “could happen.” Suppose that one in 200 long-distance telephone
calls is misdirected.

What is the probability that at least one in five attempted telephone calls reaches the wrong
number? (Assume independence of attempts.)
Hint: [Using the Probability formula evaluate the probability of one call being wrong out of five
attempted calls]

import numpy as np

# Parameters
p_misdirected = 1 / 200 # Probability of a misdirected call

# Simulate 1000 trials (calls)

num_trials = 1000
results = np.random.binomial(1, p_misdirected, num_trials)

# Count the number of misdirected calls (where result == 1)

num_misdirected_calls = np.sum(results)

# Calculate the probability from the simulation

simulated_probability = num_misdirected_calls / num_trials

# Print results
print(f"Probability of a misdirected call (simulated): {simulated_probability:.4f}")
print(f"Number of misdirected calls in {num_trials} simulated calls: {num_misdirected_calls}")

Q11) Returns on a certain business venture, to the nearest $1,000, are known to follow the
following probability distribution.
X P(x)
-2,000 0.1
-1,000 0.1
0 0.2
1000 0.2
2000 0.3
3000 0.1

(i) What is the most likely monetary outcome of the business venture?
Hint: [The outcome is most likely the expected returns of the venture]

(ii) Is the venture likely to be successful? Explain.

Hint: [Probability of % of the venture being a successful one]

(iii) What is the long-term average earning of business ventures of this kind? Explain.
Hint: [Here, the expected return to the venture is considered as the
required average]

(iv) What is a good measure of the risk involved in a venture of this kind? Compute
this measure.
Hint: [Risk here stems from the possible variability in the expected returns,
therefore, name the risk measure for this venture]

# Given data

outcomes = [-2000, -1000, 0, 1000, 2000, 3000]

probabilities = [0.1, 0.1, 0.2, 0.2, 0.3, 0.1]

# Calculate expected value (most likely monetary outcome)

expected_value = sum(x * p for x, p in zip(outcomes, probabilities))

# Round to nearest $1,000

most_likely_outcome = round(expected_value, -3)

print(f"(i) Most likely monetary outcome: ${most_likely_outcome:,}")

# Probabilities of positive returns (1000, 2000, 3000)

success_probability = sum(p for x, p in zip(outcomes, probabilities) if x > 0)

print(f"(ii) Probability of the venture being successful: {success_probability:.2%}")

print(f"(iii) Long-term average earning: ${expected_value:,}")

Hints:
For each assignment, the solution should be submitted in the below format.
1. Research and Perform all possible steps for obtaining the solution.
2. For Statistics calculations, an explanation of the solutions should be documented in detail
along with codes. Use the same word document to fill in your explanation.
Must follow these guidelines:
2.1 Be thorough with the concepts of Probability, Probability Distributions, Business
Moments, and Univariate & Bivariate visualizations.
2.2 For True/False Questions, or short answer type questions explanation is a must.
2.3 Python code for Univariate Analysis (histogram, box plot, bar plots, etc.) the data
distribution is to be attached.
3. All the codes (executable programs) should execute without errors
4. Code modularization should be followed
5. Each line of code should have comments explaining the logic and why you are using that
function

ISOM4520 Sample Midterm Examination Solution
No ratings yet
ISOM4520 Sample Midterm Examination Solution
10 pages
2a EDA
50% (2)
2a EDA
11 pages
Assignment Report - Predictive Modelling - Rahul Dubey
No ratings yet
Assignment Report - Predictive Modelling - Rahul Dubey
18 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Grade 10 Mathematics Curriculum Guide
100% (1)
Grade 10 Mathematics Curriculum Guide
16 pages
Teaching English As A Foreign Language in Large Classes
No ratings yet
Teaching English As A Foreign Language in Large Classes
7 pages
EDA folder assignment
No ratings yet
EDA folder assignment
13 pages
2a. Exploratory Data Analysis
No ratings yet
2a. Exploratory Data Analysis
7 pages
Inferential Statistics (AutoRecovered)
No ratings yet
Inferential Statistics (AutoRecovered)
12 pages
Problem Statements:: Inferential Statistics
0% (1)
Problem Statements:: Inferential Statistics
5 pages
Inferential Statistics
No ratings yet
Inferential Statistics
10 pages
ML W8 Merged
No ratings yet
ML W8 Merged
27 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
Assement Financial
No ratings yet
Assement Financial
14 pages
ML0101EN Clas Logistic Reg Churn Py v1
100% (1)
ML0101EN Clas Logistic Reg Churn Py v1
13 pages
Statistics Chapter 8
No ratings yet
Statistics Chapter 8
3 pages
Notebook 4 - Machine Learning
No ratings yet
Notebook 4 - Machine Learning
17 pages
Predictive Modeling Business Report Seetharaman Final Changes PDF
100% (1)
Predictive Modeling Business Report Seetharaman Final Changes PDF
28 pages
Machine Learning Extended Project - BrahmaChari
No ratings yet
Machine Learning Extended Project - BrahmaChari
29 pages
Greymodels
No ratings yet
Greymodels
30 pages
Day13-K-Means Clustering
No ratings yet
Day13-K-Means Clustering
10 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
2. ML Lab Record
No ratings yet
2. ML Lab Record
38 pages
Stats For Data Science Assignment-2: NAME: Rakesh Choudhary ROLL NO.-167 BATCH-Big Data B3
No ratings yet
Stats For Data Science Assignment-2: NAME: Rakesh Choudhary ROLL NO.-167 BATCH-Big Data B3
9 pages
Summarising and Analysing Data
No ratings yet
Summarising and Analysing Data
36 pages
Credit Scoring Modelling For Retail Banking Sector
No ratings yet
Credit Scoring Modelling For Retail Banking Sector
9 pages
Final
No ratings yet
Final
13 pages
Iml20 Term
No ratings yet
Iml20 Term
7 pages
Operations Research
No ratings yet
Operations Research
11 pages
Topics: Descriptive Statistics and Probability: Name of Company Measure X
No ratings yet
Topics: Descriptive Statistics and Probability: Name of Company Measure X
4 pages
Discriminant Analysis For Risk Classification and Prediction
No ratings yet
Discriminant Analysis For Risk Classification and Prediction
23 pages
MLDA1
No ratings yet
MLDA1
8 pages
Logistic Regression
No ratings yet
Logistic Regression
16 pages
Shivaji University, Kolhapur
No ratings yet
Shivaji University, Kolhapur
12 pages
Financial Accouting
No ratings yet
Financial Accouting
7 pages
HWK6 Stats
No ratings yet
HWK6 Stats
4 pages
NF Assighment4
No ratings yet
NF Assighment4
5 pages
Final Project Implementation
No ratings yet
Final Project Implementation
3 pages
Random Effects Models
No ratings yet
Random Effects Models
37 pages
Day04 Business Moments
No ratings yet
Day04 Business Moments
10 pages
Group Work Assignment Supervised and Unsupervised Learning
No ratings yet
Group Work Assignment Supervised and Unsupervised Learning
10 pages
Assignment 2 - Set+1 - Descriptive+Statistics+Probability+ (2) A
80% (5)
Assignment 2 - Set+1 - Descriptive+Statistics+Probability+ (2) A
7 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Tutorial Session 11 - Heteroscedasticity
No ratings yet
Tutorial Session 11 - Heteroscedasticity
2 pages
ML0101EN Reg Simple Linear Regression Co2 Py v1
No ratings yet
ML0101EN Reg Simple Linear Regression Co2 Py v1
4 pages
All Document Reader 1720090772302
No ratings yet
All Document Reader 1720090772302
7 pages
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
100% (1)
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
13 pages
Py_ Customer Churn Classification — Actuaries' Analytical Cookbook
No ratings yet
Py_ Customer Churn Classification — Actuaries' Analytical Cookbook
76 pages
COS10022 Data Science Assignment 2 Question
No ratings yet
COS10022 Data Science Assignment 2 Question
4 pages
Daksh DA
No ratings yet
Daksh DA
7 pages
Statisticshomeworkhelpstatisticstutoringstatisticstutor Byonlinetutorsite 101015122333 Phpapp02
No ratings yet
Statisticshomeworkhelpstatisticstutoringstatisticstutor Byonlinetutorsite 101015122333 Phpapp02
25 pages
Risk Portfolios
No ratings yet
Risk Portfolios
14 pages
Statistics GIDP Ph.D. Qualifying Exam Methodology: January 10, 9:00am-1:00pm
No ratings yet
Statistics GIDP Ph.D. Qualifying Exam Methodology: January 10, 9:00am-1:00pm
20 pages
Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory
13 pages
Assignment 2 B
No ratings yet
Assignment 2 B
10 pages
M.sc. (IT) - Part I Practical List
No ratings yet
M.sc. (IT) - Part I Practical List
16 pages
Interval Estimation: Part 1: Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2018
No ratings yet
Interval Estimation: Part 1: Statistics (MAST20005) & Elements of Statistics (MAST90058) Semester 2, 2018
18 pages
Praveen Ai
No ratings yet
Praveen Ai
6 pages
Assignment 2 - Set 1 - Solution
No ratings yet
Assignment 2 - Set 1 - Solution
5 pages
AI-900: Microsoft Azure AI Fundamentals Preparation
From Everand
AI-900: Microsoft Azure AI Fundamentals Preparation
Georgio Daccache
No ratings yet
Classification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB
From Everand
Classification, Parameter Estimation and State Estimation: An Engineering Approach Using MATLAB
Bangjun Lei
3/5 (1)
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
A Study On Knowledge and Adherence To Oral Phosphate Binders Among Chronic Kidney Disease Patients in PKD Bachok
No ratings yet
A Study On Knowledge and Adherence To Oral Phosphate Binders Among Chronic Kidney Disease Patients in PKD Bachok
19 pages
Challenges in Research & Academic Writing
No ratings yet
Challenges in Research & Academic Writing
56 pages
CV - Nguetse Tegoum Pierre Joubert
No ratings yet
CV - Nguetse Tegoum Pierre Joubert
9 pages
PDF Repeated Measures Design with Generalized Linear Mixed Models for Randomized Controlled Trials 1st Edition Toshiro Tango download
100% (1)
PDF Repeated Measures Design with Generalized Linear Mixed Models for Randomized Controlled Trials 1st Edition Toshiro Tango download
55 pages
Abdulsemed Shafi Selected by Taye
No ratings yet
Abdulsemed Shafi Selected by Taye
94 pages
Casio Scientific Calculator Fx-570ms
No ratings yet
Casio Scientific Calculator Fx-570ms
26 pages
Lecture 3
No ratings yet
Lecture 3
4 pages
BSGPT Notes
No ratings yet
BSGPT Notes
5 pages
PDF Doing Science Design Analysis and Communication of Scientific Research 1st Edition Ivan Valiela download
100% (2)
PDF Doing Science Design Analysis and Communication of Scientific Research 1st Edition Ivan Valiela download
80 pages
internship report (1)
No ratings yet
internship report (1)
23 pages
Elementary Statistics Picturing The World Larson 6th Edition Test Bank
100% (1)
Elementary Statistics Picturing The World Larson 6th Edition Test Bank
29 pages
A Guide To Dnorm, Pnorm, Qnorm, and Rnorm in R
No ratings yet
A Guide To Dnorm, Pnorm, Qnorm, and Rnorm in R
7 pages
(Ebook) Introduction to Bayesian Statistics by William M. Bolstad, James M. Curran ISBN 9781118091562, 1118091566 - Download the ebook now and read anytime, anywhere
100% (1)
(Ebook) Introduction to Bayesian Statistics by William M. Bolstad, James M. Curran ISBN 9781118091562, 1118091566 - Download the ebook now and read anytime, anywhere
57 pages
TESTING THE SIGNIFICANCE OF R Example
No ratings yet
TESTING THE SIGNIFICANCE OF R Example
4 pages
ALL ISAAC'S PROJECT Corrected
No ratings yet
ALL ISAAC'S PROJECT Corrected
25 pages
Factors That Affect The Quality of Inputs in Manufacturing Organisations: A Case Study of Nampak Kenya Limited
No ratings yet
Factors That Affect The Quality of Inputs in Manufacturing Organisations: A Case Study of Nampak Kenya Limited
27 pages
Discrete and Continuous Data - Google Search
No ratings yet
Discrete and Continuous Data - Google Search
7 pages
Recruitmentand Selection
No ratings yet
Recruitmentand Selection
14 pages
Correlation
No ratings yet
Correlation
13 pages
PDF Marketing Research 13th Edition Kumar download
100% (3)
PDF Marketing Research 13th Edition Kumar download
40 pages
Strenghts and Weaknesses of QNR
No ratings yet
Strenghts and Weaknesses of QNR
15 pages
Oakes FullChapter
No ratings yet
Oakes FullChapter
40 pages
Sample Thesis Template
100% (3)
Sample Thesis Template
5 pages
Stat Table 1
No ratings yet
Stat Table 1
17 pages
Gap Statistic
No ratings yet
Gap Statistic
32 pages
(FREE PDF Sample) (Ebook PDF) Introductory Econometrics: Asia-Pacific 2nd Edition Ebooks
100% (5)
(FREE PDF Sample) (Ebook PDF) Introductory Econometrics: Asia-Pacific 2nd Edition Ebooks
49 pages
Statistics in Discriminant Validity Testing
No ratings yet
Statistics in Discriminant Validity Testing
18 pages