0% found this document useful (0 votes)
6 views35 pages

DEV LAB MANUAL

The document outlines the practical exercises for a Data Exploration and Visualization laboratory course at Adithya Institute of Technology, focusing on various data analysis techniques using Python and R. It includes steps for installing necessary tools, performing exploratory data analysis on datasets like email and wine quality, and visualizing data through different methods. The document serves as a bonafide record for students to document their work during the academic year 2024-2025.

Uploaded by

Jemima A
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
6 views35 pages

DEV LAB MANUAL

The document outlines the practical exercises for a Data Exploration and Visualization laboratory course at Adithya Institute of Technology, focusing on various data analysis techniques using Python and R. It includes steps for installing necessary tools, performing exploratory data analysis on datasets like email and wine quality, and visualizing data through different methods. The document serves as a bonafide record for students to document their work during the academic year 2024-2025.

Uploaded by

Jemima A
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 35

ADITHYA INSTITUTE OF TECHNOLOGY

COIMBATORE – 641 107

DEPARTMENT OF ARTIFICIAL INTELLIGENCE &


DATA SCIENCE

AD3301 – DATA EXPLORATION AND VISUALIZATION


LABORATORY

Regulation 2021
Academic Year : 2024-2025 (ODD Semester)
Year / Sem : II / 03
BONAFIDE CERTIFICATE

Certified that this is the Bonafide Record of work done by

Mr/Ms........................................................ Register Number..........................................of

.........semester in the Department of Artificial Intelligence & Data Science during the

academic year...................................

Place :

Date :

Staff-In-Charge Head of the Department

Submitted for the University Practical Examination held on ............................................

Internal Examiner External Examiner


PRACTICAL EXERCISES:

1. Install the data Analysis and Visualization tool: Python.


2. Perform exploratory data analysis (EDA) on with datasets like email data set. Export all your emails as a
dataset, import them inside a pandas data frame, visualize them and get different insights from the data.
3. Working with Numpy arrays, Pandas data frames , Basic plots using Matplotlib.
4. Explore various variable and row filters in R for cleaning data. Apply various plot features in R on sample
data sets and visualize.
5. Perform Time Series Analysis and apply the various visualization techniques.
6. Perform Data Analysis and representation on a Map using various Map data sets with Mouse Rollover
effect, user interaction, etc..
7. Build cartographic visualization for multiple datasets involving various countries of the world;
states and districts in India etc.
8. Perform EDA on Wine Quality Data Set.
9. Use a case study on a data set and apply the various EDA and visualization techniques and present an
analysis report.
LIST OF EXPERIMENTS
S.NO EXPERIMENS PAGE NO MARKS SIGNATURE

9
EX NO: 1
DATE: INSTALLING DATA ANALYSIS AND VISUALIZATION TOOL

AIM:
To write a steps to install data Analysis and Visualization tool: Python.

PROCEDURE:

Anaconda is open-source software that contains Jupyter, spyder, etc that is used for large data
processing, data analytics, heavy scientific computing. Conda is a package and environment management
system that is available across Windows, Linux, and MacOS, similar to PIP. It helps in the installation of
packages and dependencies associated with a specific language like python, C++, Java, Scala, etc. Conda is
also an environment manager and helps to switch between different environments with just a few commands.
Installing Conda on Windows:
Follow the below steps to install conda on windows:
Step 1: Visit this website and download the Anaconda installer.
Step 2: Click on the downloaded .exe file and click on Next.

Step 3: Agree to the terms and conditions.


Step 4: Select the installation type.

Step 5: Choose the installation location.


Step 6: Now check the checkbox to add Anaconda to your environment Path and click Install.

This will start the installation.


Step 7: After the installation is complete you’ll get the following message, here click on Next.

Step 8: You’ll get the following screen once the installation is ready to be used. Here click on Finish.
Verifying the installation:
Now open up the Anaconda Power Shell prompt and use the below command to check the conda version:
conda -V
If conda is installed successfully, you will get a message as shown below:

Result:
Thus data Analysis and Visualization tool is installed successfully.
Ex no: 2
Date: Exploratory Data Analysis (EDA) on with Datasets

Aim:
To Perform exploratory data analysis (EDA) on with datasets like email data set.
Procedure:
Exploratory Data Analysis (EDA) on email datasets involves importing the data, cleaning it, visualizing
it, and extracting insights. Here's a step-by-step guide on how to perform EDA on an email dataset using
Python and Pandas
1. Import Necessary Libraries:
Import the required Python libraries for data analysis and visualization.
2. Load Email Data:
Assuming you have a folder containing email files (e.g., .eml files), you can use the email library to
parse and extract the email contents.
3. Data Cleaning:
Depending on your dataset, you may need to clean and preprocess the data. Common cleaning
steps include handling missing values, converting dates to datetime format, and removing duplicates.
4. Data Exploration:
Now, you can start exploring the dataset using various techniques. Here are some common EDA tasks:
Basic Statistics:
Get summary statistics of the dataset.
Distribution of Dates:
Visualize the distribution of email dates.
5. Word Cloud for Subject or Message:
Create a word cloud to visualize common words in email subjects or messages.
6. Top Senders and Recipients:
Find the top email senders and recipients.
Depending on your dataset, you can explore further, analyze sentiment, perform network analysis, or
any other relevant analysis to gain insights from your email data.
Program:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv('D:\ARCHANA\dxv\LAB\DXV\Emaildataset.csv')
# Display basic information about the dataset
print(df.info())
# Display the first few rows of the dataset
print(df.head())
# Descriptive statistics
print(df.describe())
# Check for missing values
print(df.isnull().sum())
# Visualize the distribution of numerical variables
sns.pairplot(df)
plt.show()
# Visualize the distribution of categorical variables
sns.countplot(x='label', data=df)
plt.show()
# Correlation matrix for numerical variables
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()
# Word cloud for text data (if you have a column with text data)
from wordcloud import WordCloud
text_data = ' '.join(df['text_column'])
wordcloud = WordCloud(width=800, height=400, random_state=21,
max_font_size=110).generate(text_data)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

OUT PUT:
Data columns (total 4 columns):
# Column Non-Null Count Dtype

0 Unnamed: 0 5171 non-null int64


1 label 5171 non-null object
2 text 5171 non-null object
3 label_num 5171 non-null int64
dtypes: int64(2), object(2)
memory usage: 161.7+ KB
None
Unnamed: 0 label text label_num
0 605 ham Subject: enron methanol ; meter # : 988291\r\n... 0
1 2349 ham Subject: hpl nom for january 9 , 2001\r\n( see... 0
2 3624 ham Subject: neon retreat\r\nho ho ho , we ' re ar... 0
3 4685 spam Subject: photoshop , windows , office . cheap ... 1
4 2030 ham Subject: re : indian springs\r\nthis deal is t... 0
Unnamed: 0 label_num
count 5171.000000 5171.000000
mean 2585.000000 0.289886
std 1492.883452 0.453753
min 0.000000 0.000000
25% 1292.500000 0.000000
50% 2585.000000 0.000000
75% 3877.500000 1.000000
max 5170.000000 1.000000
Unnamed: 0 0
label 0
text 0
label_num 0
dtype: int64
Result:
The above Performing exploratory data analysis (EDA) on with datasets like email data set has been
performed successfully.
Ex no: 03
Date: Working with Numpy arrays, Pandas data frames , Basic plots using Matplotlib

Aim:
Write the steps for Working with Numpy arrays, Pandas data frames , Basic plots using Matplotlib
Procedure:
1. NumPy:
NumPy is a fundamental library for numerical computing in Python. It provides support for multi-
dimensional arrays and various mathematical functions. To get started, you'll first need to install NumPy if
you haven't already (you can use pip):

pip install numpy

Once NumPy is installed, you can use it as follows:


import numpy as np
# Creating NumPy arrays
arr = np.array([1, 2, 3, 4, 5])
print(arr)
# Basic operations
mean = np.mean(arr)
sum = np.sum(arr)
# Mathematical functions
square_root = np.sqrt(arr)
exponential = np.exp(arr)
# Indexing and slicing
first_element = arr[0]
sub_array = arr[1:4]
# Array operations
combined_array = np.concatenate([arr, sub_array])
OUTPUT:
2. Pandas:
Pandas is a powerful library for data manipulation and analysis.
You can install Pandas using pip:
pip install pandas
Here's how to work with Pandas DataFrames:
import pandas as pd

# Creating a DataFrame from a dictionary


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'Age': [25, 30, 35, 28, 22],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami']
}

df = pd.DataFrame(data)
# Display the entire DataFrame
print("DataFrame:")
print(df)
# Accessing specific columns
print("\nAccessing 'Name' column:")
print(df['Name'])
# Adding a new column
df['Salary'] = [50000, 60000, 75000, 48000, 55000]
# Filtering data
print("\nPeople older than 30:")
print(df[df['Age'] > 30])
# Sorting by a column
print("\nSorting by 'Age' in descending order:")
print(df.sort_values(by='Age', ascending=False))
# Aggregating data
print("\nAverage age:")
print(df['Age'].mean())
# Grouping and aggregation
grouped_data = df.groupby('City')['Salary'].mean()
print("\nAverage salary by city:")
print(grouped_data)
# Applying a function to a column
df['Age_Squared'] = df['Age'].apply(lambda x: x ** 2)
# Removing a column
df = df.drop(columns=['Age_Squared'])
# Saving the DataFrame to a CSV file
df.to_csv('output.csv', index=False)
# Reading a CSV file into a DataFrame
new_df = pd.read_csv('output.csv')
print("\nDataFrame from CSV file:")
print(new_df)
OUTPUT:
3. Matplotlib:

Matplotlib is a popular library for creating static, animated, or interactive plots and graphs.
Install Matplotlib using pip:
pip install matplotlib
Here's a simple example of creating a basic plot:
import matplotlib.pyplot as plt
# Sample data
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Create a line plot
plt.figure(figsize=(8, 6))
plt.plot(x, y, label='Sine Wave')
plt.title('Sine Wave Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.grid(True)
plt.show()
OUTPUT:
RESULT:
Thus the above working with numpy, pandas, matplotlib has been completed successfully.
Ex no:4
Date: Exploring various variable and row filters in R for cleaning data
Aim:
Exploring various variable and row filters in R for cleaning data.
PROCEDURE:
Data Preparation and Cleaning
First, let's create a sample dataset and then explore various variable and row filters to clean the data

# Create a sample dataset


set.seed(123)
data <- data.frame(
ID = 1:10,
Age = sample(18:60, 10, replace = TRUE),
Gender = sample(c("Male", "Female"), 10, replace = TRUE),
Score = sample(1:100, 10)
)
# Print the sample data
print(data)
OUTPUT:

Variable Filters
1. Filtering by a Specific Value:
To filter rows based on a specific value in a variable (e.g., only show rows where Age is greater than
30):
filtered_data <- data[data$Age > 30, ]
2. Filtering by Multiple Conditions:
You can filter rows based on multiple conditions using the & (AND) or | (OR) operators (e.g., show
rows where Age is greater than 30 and Gender is "Male"):
filtered_data <- data[data$Age > 30 & data$Gender == "Male", ]
Row Filters
1. Removing Duplicate Rows:
To remove duplicate rows based on certain columns (e.g., remove duplicates based on 'ID'):
cleaned_data <- unique(data[, c("ID", "Age", "Gender")])
2. Removing Rows with Missing Values:
To remove rows with missing values (NA):
cleaned_data <- na.omit(data)
Data Visualization
1. Apply various plot features using the ggplot2 package to visualize the cleaned data.
# Load the ggplot2 package
library(ggplot2)
# Create a scatterplot of Age vs. Score with points colored by Gender
ggplot(data = cleaned_data, aes(x = Age, y = Score, color = Gender)) +
geom_point() +
labs(title = "Scatterplot of Age vs. Score",
x = "Age",
y = "Score")
# Create a histogram of Age
ggplot(data = cleaned_data, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "blue", alpha = 0.5) +
labs(title = "Histogram of Age",
x = "Age",
y = "Frequency")
# Create a bar chart of Gender distribution
ggplot(data = cleaned_data, aes(x = Gender)) +
geom_bar(fill = "green", alpha = 0.7) +
labs(title = "Gender Distribution",
x = "Gender",
y = "Count")

RESULT:
Thus the above Exploring various variable and row filters in R for cleaning data.
EXNO: 5 PERFORM EDA ON WINE QUALITY DATA SET.
DATE
AIM:
To write a program to Perform EDA on Wine Quality Data Set.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
data = pd.read_csv("pathname")
# Display the first few rows of the dataset
print(data.head())
# Get information about the dataset
print(data.info())
# Summary statistics
print(data.describe())
# Distribution of wine quality
sns.countplot(data['quality'])
plt.title(" Wine Quality data set")
plt.show()
# Box plots for selected features by wine quality
features = ['alcohol', 'volatile acidity', 'citric acid', 'residual sugar']
for feature in features:
plt.figure(figsize=(8, 6))
sns.boxplot(x='quality', y=feature, data=data)
plt.title(f'{feature} by Wine Quality')
plt.show()
# Pair plot of selected features
sns.pairplot(data, vars=['alcohol', 'volatile acidity', 'citric acid', 'residual sugar'],
hue='quality', diag_kind='kde')
plt.suptitle("Pair Plot of Selected Features")
plt.show()
# Correlation heatmap
corr_matrix = data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
# Histograms of selected features
features = ['alcohol', 'volatile acidity', 'citric acid', 'residual sugar']
for feature in features:
plt.figure(figsize=(6, 4))
sns.histplot(data[feature], kde=True, bins=20)
plt.title(f"Distribution of {feature}")
plt.show()
OUTPUT:
RESULT:
Thus the above program to to Perform EDA on Wine Quality Data Set.
EX NO:6
DATE: TIME SERIES ANALYSIS USING VARIOUS VISULAIZATION
TECHNIQUES
AIM:
To perform time series analysis and apply the various visualization techniques.

DOWNLOADING DATASET:
Step 1: Open google and type the following path in the address bar and download a dataset.
http://github.com/jbrownlee/Datasets.
Step 2: write the following code to get the details.
from pandas import read_csv
from matplotlib import pyplot
series=read_csv(‘pathname')
print(series.head())
series.plot()
pyplot.show()

OUTPUT:
Step 3: To get the time series line plot:
series.plot(style='-.')
pyplot.show()

Step 4:
To create a Histogram:
series.hist()
pyplot.show()
Step 5:
To create density plot:
series.plot(kind='kde')
pyplot.show()

Result:
Thus the above time analysis has been checked with Various visualization techniques.
EX NO: 7
DATE: DATA ANALYSIS AND REPRESENTATION ON A MAP

AIM:
Write a program to perform data analysis and representation on a map using various map data sets
with mouse rollover effect, user interaction.
PROCEDURE:
STEP 1:
• Make sure to install the necessary libraries.
pip install geopandas folium bokeh
PROGRAM:
from bokeh.io import show
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import figure
from bokeh.layouts import column
import pandas as pd
import folium
# Load your data
data = pd.read_csv('D:\ARCHANA\dxv\LAB\DXV\geographic.csv')
# Create a Bokeh figure
p = figure(width=800, height=400, tools='pan,wheel_zoom,reset')
# Create a ColumnDataSource to hold data
source = ColumnDataSource(data)
# Add circle markers to the figure
p.circle(x='Longitude', y='Latitude', size=10, source=source, color='orange')
# Create a hover tool for mouse rollover effect
hover = HoverTool()
hover.tooltips = [("Info", "@Info"), ("Latitude", "@Latitude"), ("Longitude",
"@Longitude")]
p.add_tools(hover)
# Display the Bokeh plot
layout = column(p)
show(layout)
# Create a map centered at a specific location
m = folium.Map(location=[latitude, longitude], zoom_start=10)
# Add markers for your data points
for index, row in data.iterrows():
folium.Marker(
location=[row['Latitude'], row['Longitude']],
popup=row['Info'], # Display additional info on mouse click
).add_to(m)
# Save the map to an HTML file
m.save('map.html')
OUTPUT:

RESULT:
Data analysis and representation on a map using various map data sets with mouse rollover effect,
user interaction has been completed successfully.
EX NO: 8
DATE: BUILDING CARTOGRAPHIC VISUALIZATION

AIM:
Build cartographic visualization for multiple datasets involving various countries of the world;
states and districts in India etc
PROCEDURE:
STEP 1:
Collect Datasets
Gather the datasets containing geographical information for countries, states, or districts. Make sure these
datasets include the necessary attributes for mapping (e.g., country/state/district names, codes, and
relevant data).
STEP 2:
Install Required Libraries:
pip install geopandas matplotlib
STEP 3:
Load Geographic Data:
Use Geopandas to load the geographic data for countries, states, or districts. Make sure to match the
geographical data with your datasets based on the common attributes.
STEP 4:
Merge Datasets:
Merge your datasets with the geographic data based on common attributes. This step is crucial for linking
your data to the corresponding geographic regions.
STEP 5:
Create Cartographic Visualizations:
Use Matplotlib to create cartographic visualizations. You can create separate plots for different datasets
or overlay them on a single map.
STEP 6:
Customize and Enhance:
Customize your visualizations based on your needs. You can add legends, labels, titles, and other
elements to enhance the interpretability of your maps.
STEP 7:
Save and Share:
Save your visualizations as image files or interactive plots if needed. You can then share these
visualizations with others.
PROGRAM:
import pandas as pd
import geopandas as gpd
import shapely
# needs 'descartes'
import matplotlib.pyplot as plt
df = pd.DataFrame({'city': ['Berlin', 'Paris', 'Munich'],
'latitude': [52.518611111111, 48.856666666667, 48.137222222222],
'longitude': [13.408333333333, 2.3516666666667, 11.575555555556]})
gdf = gpd.GeoDataFrame(df.drop(['latitude', 'longitude'], axis=1),
crs={'init': 'epsg:4326'},
geometry=[shapely.geometry.Point(xy)
for xy in zip(df.longitude, df.latitude)])
print(gdf)
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
base = world.plot(color='white', edgecolor='black')
gdf.plot(ax=base, marker='o', color='red', markersize=5)
plt.show()

OUTPUT:
city geometry
0 Berlin POINT (13.40833 52.51861)
1 Paris POINT (2.35167 48.85667)
2 Munich POINT (11.57556 48.13722)
RESULT:
Build cartographic visualization for multiple datasets involving various countries of the world;
has been visualized successfully.
EX NO :9
DATE: VISUALIZING VARIOUS EDA TECHNIQUES AS CASE STUDY FOR
IRIS DATASET
AIM:
Use a case study on a data set and apply the various EDA and visualization techniques and
present an analysis report.
PROCEDURE:
Import Libraries:
Start by importing the necessary libraries and loading the dataset.
Descriptive Statistics:
Compute and display descriptive statistics.
python
Check for Missing Values:
Verify if there are any missing values in the dataset.
Visualize Data Distributions:
Visualize the distribution of numerical variables.
python
Correlation Heatmap:
Examine the correlation between numerical variables.
Boxplots for Categorical Variables:
Use boxplots to visualize the distribution of features by species.
Violin Plots:
Combine box plots with kernel density estimation for better visualization.
Correlation between Features:
Visualize pair-wise feature correlations.
Conclusion and Summary:
Summarize key findings and insights from the analysis.
This case study provides a comprehensive analysis of the Iris dataset, including data exploration,
descriptive statistics, visualization of data distributions, correlation analysis, and feature-specific
visualizations.

You might also like