Shark Tank - Web and Social Media Analytics Case Study

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

A Project on Web and Social

Media Analytics

SHYAM KISHORE TRIPATHI


PGP - BABI

0|Page
Table of Contents

Sl. No. Description Page No.


1 Project Objective 2
2 Defining Business Problem 2
3 Reading data and performing initial data clean up 2
4 CART, Logistic Regression and Random Forest Before Ratio 4
5 CART, Logistic Regression and Random Forest After Ratio 6
6 Comparing Model and Conclusion 8

1|Page
Project Objective
In this exercise we will be developing a predictive model to predict deal or no deal using Shark Tank dataset (US based
show). Complete exercise will be based upon the text mining analytics.

Defining Business Problem


Shark Tank is an US based show wherein entrepreneurs and founders pitch their businesses in front of investors who
decides to invest or not in the businesses based on multiple parameters.

In this dataset, we have got a dataset containing Shark Tank episodes with 495 records where each entrepreneur
made their pitch to investors . Using Social Media Analytics algorithms, we will predict whether given description of
pitch it will convert into success or not.

Reading data and performing initial data clean up

1. Loading dataset
Sharktank = read.csv("Shark Tank.csv", stringsAsFactors=FALSE)

2. Loading required libraries


library(wordcloud)
library(tm)
library(SnowballC)
library(rpart)
library(rpart.plot)

3. Performing Initial Clean Up


# Creating corpus
corpus = Corpus(VectorSource(Sharktank$description))

# Convert to lower-case
corpus = tm_map(corpus, tolower)

# Remove punctuation
corpus = tm_map(corpus, removePunctuation)

# Word cloud before removing stopwords


wordcloud(corpus,colors=rainbow(7),max.words=100)

2|Page
Now we need to normalize the texts before we proceed with further analysis. Below are the steps we need
to perform
1. Converting every text to lower case
2. Removing punctuation marks and stop words.
3. Removing extra white spaces.
4. Perform Stemming of documents

We need to use DTM (Document-Term Matrix) for further analysis when basically we will be converting all
the documents as rows, terms/words as columns, frequency of the term in the document.

This will help us identify unique words in the corpus used frequently.

To reduce the dimensions in DTM, we will remove less frequent words using remove Sparse Terms and
sparsity less than 0.995

3|Page
Converting this dataset into data frame and add dependent variable “deal” into data frame as final step for
data preparation.

4. CART, Logistic Regression and Random Forest Before Ratio

To predict whether investors will invest in the businesses we will use deal as an output variable and use the
CART, logistic regression and random forest models to measure the performance and accuracy of the model.

a) Building CART Model

4|Page
Evaluating CART Model

b) Building Random Forest Model

5|Page
c) Building Logistic Regression Model

5. CART, Logistic Regression and Random Forest After Ratio

Now we will add additional variable “ Ratio “which will be derived using column ask for/valuation and then
we will re-run the models to see if we can have improved accuracy in the models

a) Building CART Model

6|Page
b) Building Random Forest Model

c) Building Logistic Regression Model

7|Page
6. Comparing Model and Conclusion

Action CART Model Logistic Regression Model Random Forest Model


Before Ratio Accuracy 65.65% 99.79% 55.35%
After Ratio Accuracy 66.06% 100% 55.75%

With CART Model we were able to predict around 65.65% and 66.06% accurate results using only description
and description + ratio respectively.

With Random Forest, we were able to predict 55.35% and 55.75% accurate results using only description and
description + ratio respectively.

With Logistic regression, it gave us 100% accuracy description and description + ratio .

From the above analysis we can confirm that Logistic Regression is the best model for proceeding further with
insight analysis.

8|Page

You might also like