Shark Tank - Web and Social Media Analytics Case Study
Shark Tank - Web and Social Media Analytics Case Study
Shark Tank - Web and Social Media Analytics Case Study
Media Analytics
0|Page
Table of Contents
1|Page
Project Objective
In this exercise we will be developing a predictive model to predict deal or no deal using Shark Tank dataset (US based
show). Complete exercise will be based upon the text mining analytics.
In this dataset, we have got a dataset containing Shark Tank episodes with 495 records where each entrepreneur
made their pitch to investors . Using Social Media Analytics algorithms, we will predict whether given description of
pitch it will convert into success or not.
1. Loading dataset
Sharktank = read.csv("Shark Tank.csv", stringsAsFactors=FALSE)
# Convert to lower-case
corpus = tm_map(corpus, tolower)
# Remove punctuation
corpus = tm_map(corpus, removePunctuation)
2|Page
Now we need to normalize the texts before we proceed with further analysis. Below are the steps we need
to perform
1. Converting every text to lower case
2. Removing punctuation marks and stop words.
3. Removing extra white spaces.
4. Perform Stemming of documents
We need to use DTM (Document-Term Matrix) for further analysis when basically we will be converting all
the documents as rows, terms/words as columns, frequency of the term in the document.
This will help us identify unique words in the corpus used frequently.
To reduce the dimensions in DTM, we will remove less frequent words using remove Sparse Terms and
sparsity less than 0.995
3|Page
Converting this dataset into data frame and add dependent variable “deal” into data frame as final step for
data preparation.
To predict whether investors will invest in the businesses we will use deal as an output variable and use the
CART, logistic regression and random forest models to measure the performance and accuracy of the model.
4|Page
Evaluating CART Model
5|Page
c) Building Logistic Regression Model
Now we will add additional variable “ Ratio “which will be derived using column ask for/valuation and then
we will re-run the models to see if we can have improved accuracy in the models
6|Page
b) Building Random Forest Model
7|Page
6. Comparing Model and Conclusion
With CART Model we were able to predict around 65.65% and 66.06% accurate results using only description
and description + ratio respectively.
With Random Forest, we were able to predict 55.35% and 55.75% accurate results using only description and
description + ratio respectively.
With Logistic regression, it gave us 100% accuracy description and description + ratio .
From the above analysis we can confirm that Logistic Regression is the best model for proceeding further with
insight analysis.
8|Page