Data Mining Journal 4 Kashan

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Bahria University,

Karachi Campus

LAB EXPERIMENT NO.


_4_

LIST OF TASKS
TASK NO OBJECTIVE

1 Using python implement Decision Tree Algorithm on Diabetes Dataset the chances of
diabetes in a person. visualize the results of the model in the form of a confusion matrix
using matplotlib and seaborn.
2 Using Knime implement Task # 01.

3 Using python perform the parameter tuning to optimize the Decision Tree performance and
compare the results with task # 1.

Date: ___________

Kashan Riaz 02-131212-075 Data mining Journal


Task No. 1: Diabetes dataset decision tree python
Solution:
 Task 1:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

# Load the dataset


data = pd.read_csv("diabetes.csv")

# Display the first few rows of the dataset


print(data.head())

# Splitting the data into features (X) and target variable (y)
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Splitting the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initializing the Decision Tree classifier


clf = DecisionTreeClassifier(criterion='entropy',splitter='best')

# Training the classifier


clf.fit(X_train, y_train)

# Predicting on the test set


y_pred = clf.predict(X_test)

# Calculating accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Creating a confusion matrix


cm = confusion_matrix(y_test, y_pred)

# Plotting the confusion matrix


plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')
plt.show()
Task No.2: Knime workflow
Solution:
 Preprocessing and decision trees.
 Data cleaning, data transformation etc.
 Correlation matrix tells which coloumns have most correlation etc.
 Decision trees can be made in knime.
 Following workflow for decision trees.


 Assign different values to different colors on color manager.


Kashan Riaz 02-131212-075 Data mining Journal
 Keep partitioning as 80 percent.

Scorer view accuracy.


 Decision tree learner view

Kashan Riaz 02-131212-075 Data mining Journal


Kashan Riaz 02-131212-075 Data mining Journal


Task No.3:Optimize decision tree
Solution:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

# Load the dataset


data = pd.read_csv("diabetes.csv")

# Splitting the data into features (X) and target variable (y)
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Splitting the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Define the parameter grid to search


param_grid = {
'criterion': ['gini', 'entropy'],
'max_depth': [None, 5, 10, 15, 20],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}

# Initialize the Decision Tree classifier


clf = DecisionTreeClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=5)

# Perform GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best parameters


best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Predicting on the test set using the best model


best_clf = grid_search.best_estimator_
y_pred_tuned = best_clf.predict(X_test)

Kashan Riaz 02-131212-075 Data mining Journal


# Calculating accuracy
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print("Accuracy (Tuned Model):", accuracy_tuned)

# Creating a confusion matrix for the tuned model


cm_tuned = confusion_matrix(y_test, y_pred_tuned)

# Plotting the confusion matrix for the tuned model


plt.figure(figsize=(8, 6))
sns.heatmap(cm_tuned, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix (Tuned Model)')
plt.show()

Accuracy increased from 75 to 78 percent now since data more tuned.

Kashan Riaz 02-131212-075 Data mining Journal

You might also like