0% found this document useful (0 votes)
31 views16 pages

2a EDA

Uploaded by

GOVINDARAJ
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
31 views16 pages

2a EDA

Uploaded by

GOVINDARAJ
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 16

2a.

Exploratory Data Analysis


Instructions:
Please share your answers filled in-line in the word document. Submit code
separately wherever applicable.

Please ensure you update all the details:


Name: _____________ Batch ID: ___________
Topic: Exploratory Data Analysis

Guidelines:
1. An assignment submission is considered complete only when the correct and executable code(s) is
submitted along with the documentation explaining the method and results. Failing to submit either
of those will be considered an invalid submission and will not be considered a correct submission.

2. Ensure that you submit your assignments correctly. Resubmission is not allowed.

3. Post the submission you can evaluate your work by referring to the keys provided. (will be available
only post the submission).
dictionary as displayed in the image below:

Hints: Follow CRISP-ML(Q) methodology steps, where were appropriate.


1. Data Understanding: work on each feature of the dataset to create a data

Make a table as shown above and provide information about the features such as its data
type and its relevance to the model building. And if not relevant, provide reasons and a
description of the feature.

Problem Statements:

© 360DigiTMG. All Rights Reserved.


Q1) Calculate Mean, and Standard Deviation using Python code & draw inferences on the
following data. Refer to the Datasets attachment for the data file.
Hint: [Insights drawn from the data such as data is normally distributed/not, outliers, measures
like mean, median, mode, variance, std. deviation]
a. Car’s speed and distance

import pandas as pd

data = pd.read_csv("D:\DATA SET\Q1_a.csv")

© 360DigiTMG. All Rights Reserved.


print("First few rows of the dataset:")
print(data.head())

mean_speed = data['speed'].mean()
std_dev_speed = data['speed'].std()

mean_distance = data['dist'].mean()
std_dev_distance = data['dist'].std()

print("\nStatistics for 'speed':")


print(f"Mean: {mean_speed}")
print(f"Standard Deviation: {std_dev_speed}")

print("\nStatistics for 'distance':")


print(f"Mean: {mean_distance}")
print(f"Standard Deviation: {std_dev_distance}")

b. Top Speed (SP) and Weight (WT)

© 360DigiTMG. All Rights Reserved.


import pandas as pd

© 360DigiTMG. All Rights Reserved.


data = pd.read_csv("D:\DATA SET\Q1_b.csv")

print("First few rows of the dataset:")


print(data.head())

mean_sp = data['SP'].mean()
std_dev_sp = data['SP'].std()

mean_wt = data['WT'].mean()
std_dev_wt = data['WT'].std()

print("\nStatistics for 'SP' (Top Speed):")


print(f"Mean: {mean_sp}")
print(f"Standard Deviation: {std_dev_sp}")

print("\nStatistics for 'WT' (Weight):")


print(f"Mean: {mean_wt}")
print(f"Standard Deviation: {std_dev_wt}")

Q2) Below are the scores obtained by a student on tests.


34, 36, 36, 38, 38, 39, 39, 40, 40, 41, 41, 41, 41, 42, 42, 45, 49, 56
1) Find the mean, median and mode, variance, and standard deviation.
import numpy as np

© 360DigiTMG. All Rights Reserved.


from scipy import stats

scores = [34, 36, 36, 38, 38, 39, 39, 40, 40, 41, 41, 41, 41, 42, 42, 45, 49, 56]

mean = np.mean(scores)

median = np.median(scores)

mode = stats.mode(scores)[0][0]

variance = np.var(scores)

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")

2) What can we say about the student marks?


import numpy as np
from scipy import stats

© 360DigiTMG. All Rights Reserved.


scores = [34, 36, 36, 38, 38, 39, 39, 40, 40, 41, 41, 41, 41, 42, 42, 45, 49, 56]

std_dev = np.std(scores)

print(f"Standard Deviation: {std_dev}")

3) What can you say about the Excepted value for the student score?

Q3) Three Coins are tossed, find the probability that two heads and one tail are obtained.

import numpy as np

num_trials = 1000000 # You can adjust this number based on desired accuracy

results = np.random.randint(0, 2, size=(num_trials, 3)) # 0 represents Tail, 1 represents Head

count_two_heads_one_tail = np.sum((results.sum(axis=1) == 2) & (results[:, 0] == 0))

probability = count_two_heads_one_tail / num_trials

print(f"Probability of getting two heads and one tail: {probability}")

© 360DigiTMG. All Rights Reserved.


Q4) Two Dice are rolled, find the probability that the sum is
a) Equal to 1
b) Less than or equal to 4
c) Sum is divisible by 2 and 3
import random

def roll_dice():
"""Simulates rolling two dice and returns their sum."""
return random.randint(1, 6) + random.randint(1, 6)

def main():

simulations = 10000

count_equal_to_1 = 0
count_less_than_or_equal_to_4 = 0
count_divisible_by_2_and_3 = 0

for _ in range(simulations):
sum = roll_dice()
if sum == 1:
count_equal_to_1 += 1
if sum <= 4:
count_less_than_or_equal_to_4 += 1
if sum % 2 == 0 and sum % 3 == 0:
count_divisible_by_2_and_3 += 1

© 360DigiTMG. All Rights Reserved.


probability_equal_to_1 = count_equal_to_1 / simulations
probability_less_than_or_equal_to_4 = count_less_than_or_equal_to_4 / simulations
probability_divisible_by_2_and_3 = count_divisible_by_2_and_3 / simulations

print("a) Equal to 1:", probability_equal_to_1)


print("b) Less than or equal to 4:", probability_less_than_or_equal_to_4)
print("c) Sum is divisible by 2 and 3:", probability_divisible_by_2_and_3)

if __name__ == "__main__":
main()
Q5) A bag contains 2 red, 3 green, and 2 blue balls. Two balls are drawn at random. What is the
probability that none of the balls drawn is blue?

import math

total_red = 2
total_green = 3
total_blue = 2
total_balls = total_red + total_green + total_blue

ways_no_blue = math.comb(total_red + total_green, 2)

© 360DigiTMG. All Rights Reserved.


total_ways = math.comb(total_balls, 2)

probability_no_blue = ways_no_blue / total_ways

print(f"The probability that none of the balls drawn is blue is {probability_no_blue:.4f}")

Q6) Calculate the Expected number of candies for a randomly selected child:
Below are the probabilities of the count of candies for children (ignoring the nature of the child-
Generalized view)
i. Child A – the probability of having 1 candy is 0.015.
ii. Child B – the probability of having 4 candies is 0.2.

CHILD Candies count Probability


A 1 0.015
B 4 0.20
C 3 0.65
D 5 0.005
E 6 0.01
F 2 0.12

Q7) Calculate Mean, Median, Mode, Variance, Standard Deviation, and Range & comment
about the values / draw inferences, for the given dataset.
- For Points, Score, Weigh>
Find Mean, Median, Mode, Variance, Standard Deviation, and Range and comment on the
values/ Draw some inferences.

© 360DigiTMG. All Rights Reserved.


Dataset: Refer to Hands-on Material in LMS - Data Types EDA assignment snapshot of the
dataset is given above.

Q8) Calculate the Expected Value for the problem below.


a) The weights (X) of patients at a clinic (in pounds), are.
108, 110, 123, 134, 135, 145, 167, 187, 199
Assume one of the patients is chosen at random. What is the Expected Value of the
Weight of that patient?

weights = [108, 110, 123, 134, 135, 145, 167, 187, 199]

# Number of patients
n = len(weights)

# Calculate the sum of weights


sum_weights = sum(weights)

# Calculate the Expected Value


expected_value = sum_weights / n

© 360DigiTMG. All Rights Reserved.


# Print the Expected Value
print(f"The Expected Value of the weight of a randomly chosen patient is:
{expected_value:.2f} pounds")

Q9) Look at the data given below. Plot the data, find the outliers, and find out: μ , σ , σ 2
Hint: [Use a plot that shows the data distribution, and skewness along with the outliers; also
use Python code to evaluate measures of centrality and spread]

Name of company Measure X


Allied Signal 24.23%
Bankers Trust 25.53%
General Mills 25.41%
ITT Industries 24.14%
J.P.Morgan & Co. 29.62%
Lehman Brothers 28.25%
Marriott 25.81%
MCI 24.39%
Merrill Lynch 40.26%
Microsoft 32.95%
Morgan Stanley 91.36%
Sun Microsystems 25.99%
Travelers 39.42%
US Airways 26.71%
Warner-Lambert 35.00%

import matplotlib.pyplot as plt


import numpy as np

# Company names and their Measure X values


companies = [
"Allied Signal", "Bankers Trust", "General Mills", "ITT Industries",
"J.P.Morgan & Co.", "Lehman Brothers", "Marriott", "MCI",

© 360DigiTMG. All Rights Reserved.


"Merrill Lynch", "Microsoft", "Morgan Stanley", "Sun Microsystems",
"Travelers", "US Airways", "Warner-Lambert"
]

measure_x = [
24.23, 25.53, 25.41, 24.14, 29.62, 28.25, 25.81, 24.39,
40.26, 32.95, 91.36, 25.99, 39.42, 26.71, 35.00
]

# Convert measure_x to numpy array for easier statistical calculations


measure_x = np.array(measure_x)

# Plot boxplot to identify outliers


plt.figure(figsize=(10, 6))
plt.boxplot(measure_x, vert=False)
plt.yticks([])
plt.title('Boxplot of Measure X')
plt.xlabel('Measure X (%)')
plt.grid(True)
plt.show()

# Calculate mean, standard deviation, and variance


mean = np.mean(measure_x)
std_dev = np.std(measure_x)
variance = np.var(measure_x)

print(f"Mean (μ): {mean:.2f}%")

© 360DigiTMG. All Rights Reserved.


print(f"Standard Deviation (σ): {std_dev:.2f}%")
print(f"Variance (σ^2): {variance:.2f}%")

Q10) AT&T was running commercials in 1990 aimed at luring back customers who had switched
to one of the other long-distance phone service providers. One such commercial shows a
businessman trying to reach Phoenix and mistakenly getting Fiji, where a half-naked native on a
beach responds incomprehensibly in Polynesian. When asked about this advertisement, AT&T
admitted that the portrayed incident did not actually take place but added that this was an
enactment of something that “could happen.” Suppose that one in 200 long-distance telephone
calls is misdirected.

What is the probability that at least one in five attempted telephone calls reaches the wrong
number? (Assume independence of attempts.)
Hint: [Using the Probability formula evaluate the probability of one call being wrong out of five
attempted calls]

import numpy as np

# Parameters
p_misdirected = 1 / 200 # Probability of a misdirected call

# Simulate 1000 trials (calls)


num_trials = 1000
results = np.random.binomial(1, p_misdirected, num_trials)

# Count the number of misdirected calls (where result == 1)


num_misdirected_calls = np.sum(results)

# Calculate the probability from the simulation


simulated_probability = num_misdirected_calls / num_trials

# Print results
print(f"Probability of a misdirected call (simulated): {simulated_probability:.4f}")
print(f"Number of misdirected calls in {num_trials} simulated calls: {num_misdirected_calls}")

© 360DigiTMG. All Rights Reserved.


Q11) Returns on a certain business venture, to the nearest $1,000, are known to follow the
following probability distribution.
X P(x)
-2,000 0.1
-1,000 0.1
0 0.2
1000 0.2
2000 0.3
3000 0.1

(i) What is the most likely monetary outcome of the business venture?
Hint: [The outcome is most likely the expected returns of the venture]

(ii) Is the venture likely to be successful? Explain.


Hint: [Probability of % of the venture being a successful one]

(iii) What is the long-term average earning of business ventures of this kind? Explain.
Hint: [Here, the expected return to the venture is considered as the
required average]

(iv) What is a good measure of the risk involved in a venture of this kind? Compute
this measure.
Hint: [Risk here stems from the possible variability in the expected returns,
therefore, name the risk measure for this venture]

# Given data

outcomes = [-2000, -1000, 0, 1000, 2000, 3000]

probabilities = [0.1, 0.1, 0.2, 0.2, 0.3, 0.1]

# Calculate expected value (most likely monetary outcome)

expected_value = sum(x * p for x, p in zip(outcomes, probabilities))

# Round to nearest $1,000

most_likely_outcome = round(expected_value, -3)

© 360DigiTMG. All Rights Reserved.


print(f"(i) Most likely monetary outcome: ${most_likely_outcome:,}")

# Probabilities of positive returns (1000, 2000, 3000)

success_probability = sum(p for x, p in zip(outcomes, probabilities) if x > 0)

print(f"(ii) Probability of the venture being successful: {success_probability:.2%}")

print(f"(iii) Long-term average earning: ${expected_value:,}")

Hints:
For each assignment, the solution should be submitted in the below format.
1. Research and Perform all possible steps for obtaining the solution.
2. For Statistics calculations, an explanation of the solutions should be documented in detail
along with codes. Use the same word document to fill in your explanation.
Must follow these guidelines:
2.1 Be thorough with the concepts of Probability, Probability Distributions, Business
Moments, and Univariate & Bivariate visualizations.
2.2 For True/False Questions, or short answer type questions explanation is a must.
2.3 Python code for Univariate Analysis (histogram, box plot, bar plots, etc.) the data
distribution is to be attached.
3. All the codes (executable programs) should execute without errors
4. Code modularization should be followed
5. Each line of code should have comments explaining the logic and why you are using that
function

© 360DigiTMG. All Rights Reserved.

You might also like