Vardha DS
Vardha DS
Vardha DS
INDEX
SR.
DATE PARTICULAR SIGNATURE
NO.
1 Data Preparation
11/11/22
Principal Component
2
17/11/22 Analysis
5 Clustering
2/12/22
6 9/12/22 Association
Description
Dataset –
Chi-Square distribution
Degrees of freedom
Role/Importance
The Chi-square test is intended to test how likely it is that an observed
distribution is due to chance. It is also called a “goodness of fit”
statistic,because it measures how well the observed distribution of data
fits with the distribution that is expected if the variables are
independent.
Dataset
Code:
data_frame <-
read.csv("C:/Users/student/Desktop/JanhaviTYSem6/treatment.csv")
#Reading CSV
table(data_frame$treatment, data_frame$improvement)
chisq.test(data_frame$treatment, data_frame$improvement)
Output:
Output -
Interpretation
p-value: 0.2796
p- value is greater than 0.05
Conclusion
#Plot PCA
library(devtools)
install_github("vqv/ggbiplot")
library(ggbiplot)
ggbiplot(mtcars.pca)
ggbiplot(mtcars.pca, labels=rownames(mtcars))
ggbiplot(mtcars.pca,ellipse=TRUE,choices=c(3,4),
labels=rownames(mtcars), groups=mtcars.country)
OUTPUT.:-
CONCLUSION.:-
Europe and US Origin Cars have higher variance as compared to Japan
Origin Cars
There’s separation between American and Japanese cars along a
principal component that is closely correlated to cyl, disp, wt and mpg.
These variables can be considered .
Vipul Gupta (4837)
DATA SCIENCE
PRACTICAL 3 :EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis refers to the critical process of performing
initial investigations on data so as to discover patterns, to spot
anomalies, to test hypothesis and to check assumptions with the help
of summary statistics and graphical representations.
At a high level, EDA is the practice of describing the data by means
of statistical and visualization techniques to bring important
aspects of that data into focus for further analysis.
This involves looking at your data set from many angles,
describing it, and summarizing it without making any assumptions
about its contents.
This is a significant step to take before diving into machine
learning or statistical modeling, to make sure the data are really
what they are claimed to be and that there are no obvious
problems.
Diabetes Dataset
Data set
The datasets consist of several medical predictor (independent)
variables and one target (dependent) variable, Outcome. Independent
variables include the number of pregnancies the patient has had, their
BMI, insulin level, age, and so on.
Data Description
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration 2 hours in an oral
glucose tolerance test
Blood Pressure: Diastolic blood pressure (mm Hg)
Skin Thickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
Diabetes Pedigree Function: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1) 1 indicates diabetes is present
Program
diabet <- read.csv('C:/Users/student/Downloads/diabetes.csv')
head(diabet)
str(diabet)
summary(diabet)
# Display only 10 values from whole data
diabet[1:10,]
#Find the columns of the dataset and to check missing values in dataset
names(diabet)
colSums(is.na(diabet))
#Draw histogram
hist(diabet$BMI,col='RED')
#Do boxplot
boxplot(diabet$BMI)
#Find mean, median , max and min
mean(diabet$BMI)
median(diabet$BMI)
max(diabet$BMI)
min(diabet$BMI)
barplot(count)
pie(count)
#Draw a bar chart and boxplot of BMI and Glucose for the above subset
barplot(newdata7$BMI, newdata7$Glucose, beside=TRUE,
col='YELLOW')
boxplot(newdata7$BMI, newdata7$Glucose, beside=TRUE,
col='YELLOW')
newdata8 < -subset(diabet, diabet$Pregnancies >= 8 | diabet$Outcome
== "1")
newdata8
#Draw a bar chart and boxplot of BMI and Glucose for the above subset
barplot(newdata8$BMI, newdata8$Glucose, beside=TRUE,
col='YELLOW')
boxplot(newdata8$BMI, newdata8$Glucose, beside=TRUE,
col='YELLOW')
Conclusion –
1. BMI not affecting much .
2. Blood pressure level not affecting diabetes.
3. Pregnancies affecting diabetes levels.
Valorant Dataset
Data set
This dataset contains various stats about the game's weapons like
damage, price, fire rate, etc.
Program
valorant <-read.csv('C:/Users/student/Downloads/valorant-stats.csv')
head(valorant)
str(valorant)
summary(valorant)
valorant[1:10, ]
valorant[, 1:2]
valorant[1:10, 1:2]
newdata1 < -subset(valorant, valorant$Magazine.Capacity >= "12")
newdata1
newdata2 < -subset(valorant, valorant$Weapon.Type == "Rifle" &
valorant$Wall.Penetration == "Medium")
newdata2
newdata3 < -subset(valorant, valorant$Fire.Rate >= "5" | valorant$Price
>= "500", select=c(1, 2))
newdata3
names(valorant)
colSums( is .na(valorant))
hist(valorant$BDMG_1 ,col='RED')
plot(valorant$BDMG_1)
boxplot(valorant$BDMG_1)
mean(valorant$BDMG_1)
median(valorant$BDMG_1)
max(valorant$BDMG_1)
min(valorant$BDMG_1)
class(Magazine.Capacity)
table(Wall.Penetration)
count < -table(Wall.Penetration)
barplot(count,col=2)
pie(count)
table(Magazine.Capacity)
count < -table(Magazine.Capacity)
barplot(count)
pie(count)
Conclusion –
1. There are more guns with medium penetration
2. The maximum number of magazine capacity is 12 and 30
Vipul Gupta (4837)
DATA SCIENCE
PRACTICAL 4 : Decision Tree
Diabetes dataset
Code:
install.packages("partykit")
install.packages("caret",type="win.binary")
install.packages("pROC",type="win.binary")
install.packages('rattle',type="win.binary")
install.packages('rpart.plot',type="win.binary")
install.packages('data.table',type="win.binary")
titanic<-read.csv(file.choose(),header = T,sep=",")
summary(titanic)
names(titanic)
library(partykit)
titanic$Outcome<-as.factor(titanic$Outcome)#convert to categorical
summary(titanic$Outcome)
names(titanic)
set.seed(1234)
pd<-sample(2,nrow(titanic),replace = TRUE, prob=c(0.8,0.2))#two
samples with distribution 0.8 and 0.2
trainingset<-titanic[pd==1,]#first partition
validationset<-titanic[pd==2,]#second partition
tree<-ctree(formula = Outcome ~ Pregnancies + Glucose +
BloodPressure + SkinThickness + Insulin + BMI +
DiabetesPedigreeFunction + Age ,data=trainingset)
class(titanic$Outcome)
plot(tree)
#Prunning
tree<-ctree(formula = Outcome ~ Pregnancies + Glucose +
BloodPressure + SkinThickness + Insulin + BMI +
DiabetesPedigreeFunction +
Age ,data=trainingset,control=ctree_control(mincriterion =
0.99,minsplit = 500))
plot(tree)
pred<-predict(tree,validationset,type="prob")
pred
pred<-predict(tree,validationset)
pred
library(caret)
confusionMatrix(pred,validationset$Outcome)
pred<-predict(tree,validationset,type="prob")
pred
library(pROC)
plot(roc(validationset$Outcome,pred[ ,2]))
library(rpart)
fit <- rpart(Outcome ~ Pregnancies + Glucose + BloodPressure +
SkinThickness + Insulin + BMI + DiabetesPedigreeFunction +
Age ,data=titanic,method="class")
plot(fit)
text(fit)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(fit)
Prediction <- predict(fit, titanic, type = "class")
Prediction
Output:
Conclusion:
Accuracy: 69.75%
Green color in tree = survival
Blue color in tree = non-survival
Darker shades mean more survival/non-survival
The ROC graph shows that the model is not very accurate as
sensitivity and specificity are almost same.
The true positive is not high enough so accuracy is medium.
Specificity & sensitivity should be greater than 80 for proper
accuracy.
partykit: A Toolkit for Recursive Partitioning
A toolkit with infrastructure for representing, summarizing, and
visualizing tree-structured regression and classification models. This
unified infrastructure can be used for reading/coercing tree models
from different sources ('rpart', 'RWeka', 'PMML') yielding objects that
share functionality for print ()/plot ()/predict () methods.
Caret:
The caret package (short for Classification and Regression Training)
contains functions to streamline the model training process for complex
regression and classification problems.
pROC
pROC is a set of tools to visualize, smooth and compare receiver
operating characteristic (ROC curves). (Partial) area under the curve
(AUC) can be compared with statistical tests based on U-statistics or
bootstrap. Confidence intervals can be computed for (p)AUC or ROC
curves.
Rattle
A package written in R providing a graphical user interface to very many
other R packages that provide functionality for data mining.
Data.table
Data. table is an extension of data. frame package in R. It is widely used
for fast aggregation of large datasets, low latency
add/update/remove of columns, quicker ordered joins, and a fast file
reader.
rpart.plot
This function combines and extends plot. rpart and text. rpart in the
rpart package. It automatically scales and adjusts the displayed tree for
best fit.
Vipul Gupta (4837)
DATA SCIENCE
PRACTICAL 5:CLUSTERING
1. Import dataset
data<-read.csv("C:/Users/student/Desktop/4823_JanhaviTYSem6/
ds/coordinate.csv")
data<-data[1:150,]
names (data)
For Y
5. Calculate WSS
data<-new_data
wss<-sapply(1:15, function(k){kmeans(data,k)}$tot.withinss)
wss
library(ggplot2)
ggplot(data, aes(X,y)) +
geom_point(alpha = 0.4, size = 3.5) + geom_point(col = clustercut) +
scale_color_manual (values = c("red", "green","black","blue"))
12. DBSCAN clustering
library(fpc)
data_1<-data[-5]
set.seed(220)
Dbscan_cl<-dbscan(data_1,eps=0.45,MinPts = 5)
Dbscan_cl$cluster
table(Dbscan_cl$cluster , data$X)
plot(Dbscan_cl , data_1 , main="DBScan")
plot(Dbscan_cl , data_1 , main = "X vs Y")
Conclusion:
As we can see the box plot of both x and y feature there is no outliers
were present, thus no need to construct features.
Both elbow method and silhouette methods give 2 cluster as optimal
clusters, thus we can use to make to clusters
We construct clusters using k means, Hierarchical and DBSCAN
clustering method but K means clusters shows good representation of
clustering data than remaining both
It is found that clusters are made using x feature as prime aspect. As we
can see in below figure as x value increases then second cluster is got
formed, we conclude that by observing y feature in both clusters
contain high value but first cluster has low x feature values than second
cluster, thus in this case x feature is prime aspect for clustering.
Vipul Gupta (4837)
DATA SCIENCE
PRACTICAL 6: ASSOCIATION
Association:
Association is a data mining technique that discovers the probability of
the co-occurrence of items in a collection. The relationships between
co-occurring items are expressed as Association Rules. Association rule
mining finds interesting associations and relationships among large sets
of data items. Association rules are "if-then" statements, that help to
show the probability of relationships between data items, within large
data sets in various types of databases. Here the If element is called
antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation
between two items is known as single cardinality. Association rule
mining has a number of applications and is widely used to help discover
sales correlations in transactional data or in medical data sets.
Apriori:
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for
finding frequent itemsets in a dataset for boolean association rule.
Name of the algorithm is Apriori because it uses prior knowledge of
frequent itemset properties. We apply an iterative approach or level-
wise search where k-frequent itemsets are used to find k+1 itemsets.
To improve the efficiency of level-wise generation of frequent itemsets,
an important property is used called Apriori property which helps by
reducing the search space.
Apriori Property – All non-empty subset of frequent itemset must be
frequent.
Limitations of Apriori Algorithm
Apriori Algorithm can be slow.
The main limitation is time required to hold a vast number of candidate
sets with much frequent itemsets, low minimum support or large
itemsets i.e. it is not an efficient approach for large number of datasets.
It will check for many sets from candidate itemsets, also it will scan
database many times repeatedly for finding candidate itemsets. Apriori
will be very low and inefficiency when memory capacity is limited with
large number of transactions.
Algorithm
Calculate the support of item sets (of size k = 1) in the
transactional database (note that support is the frequency of
occurrence of an itemset). This is called generating the candidate
set.
Prune the candidate set by eliminating items with a support less
than the given threshold.
Join the frequent itemsets to form sets of size k + 1, and repeat
the above sets until no more itemsets can be formed. This will
happen when the set(s) formed have a support less than the given
support.
OR
1. Set a minimum support and confidence.
2. Take all the subset present in the transactions which have higher
support than minimum support.
3. Take all the rules of these subsets which have higher confidence than
minimum confidence.
4. Sort the rules by decreasing lift.
Components of Apriori
Support and Confidence:
Support refers to items’ frequency of occurrence i.e. x and y items are
purchased together, confidence is a conditional probability that y item
is purchased given that x item is purchased.
Support( I )=( Number of transactions containing item I ) / ( Total
number of transactions )
Confidence( I1 -> I2 ) =( Number of transactions containing I1 and I2 ) / (
Number of transactions containing I1 )
Lift:
Lift gives the correlation between A and B in the rule A=>B. Correlation
shows how one item-set A effects the item-set B.
If the rule had a lift of 1,then A and B are independent and no rule can
be derived from them.
If the lift is > 1, then A and B are dependent on each other, and the
degree of which is given by ift value.
If the lift is < 1, then presence of A will have negative effect on B.
Lift( I1 -> I2 ) = ( Confidence( I1 -> I2 ) / ( Support(I2) )
Coverage:
Coverage (also called cover or LHS-support) is the support of the left-
hand-side of the rule X => Y , i.e., supp(X).
It represents a measure of to how often the rule can be applied.
Coverage can be quickly calculated from the rule’s quality measures
(support and confidence)
Fp tree:
The FP-Growth Algorithm proposed by Han in The FP-Growth Algorithm
is an alternative way to find frequent item sets without using candidate
generations, thus improving performance. For so much, it uses a divide-
and-conquer strategy. The core of this method is the usage of a special
data structure named frequent-pattern tree (FP-tree), which retains the
item set association information. This tree-like structure is made with
the initial itemsets of the database. The purpose of the FP tree is to
mine the most frequent pattern. Each node of the FP tree represents an
item of the itemset.
The root node represents null while the lower nodes represent the
itemsets. The association of the nodes with the lower nodes that is the
itemsets with the other itemsets are maintained while forming the
tree.
Algorithm:
Building the tree
Dataset: supermarket.csv
1. Import libraries
library(arules)
library(arulesViz)
library(RColorBrewer)
2. Import dataset
data<-read.transactions('D:/college/sem_6/data
science/code/supermarket.csv', rm.duplicates= TRUE,
format="single",sep=",",header = TRUE,cols=c("Branch","Product
line"))
#data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/Super
Store.csv')
#data <- subset(data, select = c(0,1))
3. Display structure of data
str(data)
5. Labels of items
data@itemInfo$labels
6. Generating rules
data_rules <- apriori(data, parameter = list(supp = 0.01, conf = 0.2))
data_rules
7. Inspect rules
inspect(data_rules[1:20])
CODE:
library(igraph)
data<-read.csv("C:\\Users\\student\\Downloads\\income1.csv")
attach(data)
head(data)
x<-data$Year
y<-data$Value
d.y<-diff(y)
library(ggplot2)
ggplot(data, aes(x,y)) +
geom_point() +
theme(axis.text.x = element_text(angle=45, hjust=1, vjust = 1))
#plot(x,y)
acf(y)
pacf(y)
acf(d.y)
arima(y,order = c(1,0,0))
mydata.arima001<-arima(y,order=c(0,0,1))
mydata.pred01<-predict(mydata.arima001,n.ahead = 100)
head(mydata.pred01)
plot(y)
lines(mydata.pred01$pred,col='blue')
attach(mydata.pred01)
tail(mydata.pred01$pred)
head(mydata.pred01$pred)
Output:
1. Import CSV file
data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/income1.csv')
attach(data)
head(data)
Conclusion:
As we can see trend of flow of y parameter as blue line we can
conclude that model is predicting values as per trend
Vipul Gupta (4837)
DATA SCIENCE
PRACTICAL 8: MONGODB
> db.student_mark.insert({name:"kunal",marks:[{physics:79},
{chem:89},{bio:87}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Sajjad",marks:[{physics:90},
{chem:79},{bio:84}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Pankaj",marks:[{physics:76},
{chem:89},{bio:67}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Akshay",marks:[{physics:63},
{chem:78},{bio:88}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Yash",marks:[{physics:71},{chem:55},
{bio:65}]})
WriteResult({ "nInserted" : 1 })
Display record in json format
db.student.find().forEach(printjson)
> db.student.find({age:{$gt:22}}).pretty()
display details of student who’s city is pune
db.student.find({'address.city':'Pune'}).pretty()
Display student who got more than 84 marks in physics
db.student.find({'address.city':{$in:["Pune","mumbai"]},age:
{$gte:21}}).pretty()
Display students who got more than 70 marks in all subject
db.student_mark.find({'marks.bio':{$gte:70},'marks.physics':
{$gte:70},'marks.chem':{$gte:70}}).pretty()
Delete collection
db.student_mark.drop()
Drop table
db.dropDatabase()
Vipul Gupta (4837)
DATA SCIENCE
Practical 9: topic Modelling
provide list of stopwords to fin in each text files and map it on words
from files
mystopwords=c("of","a","and","the","in","to","for","that","is","on","ar
e","with","as","by"
,"be","an","which","it","from","or","can","have","these","has","
such","you")
mycorpus<-tm_map(mycorpus,tolower)
mycorpus<-tm_map(mycorpus,removeWords,mystopwords)
dtm<-DocumentTermMatrix(mycorpus)
k<-3
#lda_output_3<-LDA(dtm,k,method="VEM",control=control_VEM)
# control_VEM
#lda_output_3<-LDA(dtm,k,method="VEM",control=NULL)
lda_output_3<-LDA(dtm,k,method="VEM")
#lda_output_3<-LDA(dtm,k,method="VEM")
#lda_output_3@Dim
#lda_output_3<-LDA(dtm,k,method="VEM")
#show (dtm)
topics(lda_output_3)
terms(lda_output_3,10)
Output:
Conclusion:
The keywords used in all the texts files are more suitable for natural
language processing (NPL)