G20 - Crowdfunding Predicting Kickstarter Project Success

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

1

Crowdfunding : Predicting information such as fundraising goal,


Kickstarter Project Success short description of the project,
creator description.
Authors

Siddharth Bhandari Introduction


19UCS080
Kickstarter is an amazing platform
Meet Kumar Jain that helps entrepreneurs launch
19UCS035 campaigns, and “help bring creative
projects to life”.
Vidhi Mittal In order to be successful i.e. have
19UCC108 enough and required funding goals, a
campaign must effectively convey its
Under Guidance of mission to its targeted backers or
Dr. Bharavi Mishra, audience and try to get them to
Faculty - Machine Learning choose their project over others.
Although anyone could broadcast their
creative ideas on this website, not
Abstract everyone manages to get enough
pledge before the campaign deadline.
We try to find out whether it is
Investors usually find it difficult to
possible to predict the success or the
predict or find the companies and
failure of a campaign in the start by
startups that have high chances of
using the information present at the
success as it requires them to do
beginning of the campaign.
manual calculation and analysis.
Crowdfunding has gained popularity
crowdfunding has emerged as an
over time, with an ever-increasing
alternative for this approach as
number of campaigns and investors
disruptive innovation for financing a
participating. It was relatively popular
variety of new entrepreneurial
from the start, and it has rapidly
ventures without standard financial
grown in prominence since then.
intermediaries. The aim of the present
study is to study the concept of
Kickstarter, we predict whether a
Kickstarter project will succeed or fail Relevant Literature
in achieving its fundraising goal using
only information from project launch. There have been several studies that
Using various algorithms of Machine leverage machine learning techniques
Learning, we evaluate the success rate to predict the success of a campaign.
of the startup, the accuracy and the The project or we can say the website
precision of the project. is of great help both for the creator as
This is calculated from the given well as for the Kickstarter as such
2

knowledge would enable creators to be able to detect this.


work more on their project and give Luckily, many project aspects cannot
sufficient and justified time to their be changed after launch, including the
project while for the Kickstarters, the funding goal.
platform will help in prioritizing the For our project, we are using only the
projects based on the accuracy and dataset of the month of April for the
precision of the following ( it is year 2019 which includes more than
assumed that the accuracy lie in the 2,00,000 entries for the same.
range of 60 to 80 for a good project ).

CATEGORY ENCODING
Dataset and Features Preprocess our features and dummy
categorical variables before looking
DATA into the data further.
Projects on Kickstarter combine a We did data preprocessing where we
variety of data types. There are deleted duplicate or redundant
almost 38 columns or we can say columns and split time from its UNIX
there are almost 38 entries for each format to normal time zone format.
dataset which includes fundraising We also splitted the countries and
goals, creator details, details of the states into their subcategories to
project or startup, backstory, understand the data more precisely.
categories, photos, videos. We also divided all the projects into
We use a well-maintained repository their parent categories where the
containing data for over 2,50,000 parent category was further divided
Kickstarter projects. The authors into subcategories which we named
designed a robot to crawl Kickstarter child categories which would make it
each month and scrape the labeled easy for Kickstarters to see or
HTML output into CSV files. This understand and explore their
output contains all the information we categories.
use for this analysis, including the As discussed above, each Kickstarter
funding goal, project categories, and a project is associated with a variety of
short project blurb. Although the data categorical variables, including the
is stored as a CSV, many of these project category (e.g. Board Games,
features are stored as JSON strings, Documentary, etc.) and the parent
which we expand to obtain our desired category (e.g. Gaming, Film, etc.), the
project variables. creator's city and the country of origin
We attempt only to select information as well as the currency of the
that would be available at the time of donation.
launch. However, since our data
comes from a monthly snapshot, if a We divided our dataset into two
creator were to edit their project categories in the ratio of 70:30. While
metadata after launch, we would not the minority ratio is for testing
3

purposes, the 70% of the dataset is test precision, so there is some scope
used for training purposes under for raising our precision score with
which 30% of the 70% which is other models.
technically 21% of the original dataset
is for validation purposes.

Models

In machine learning, we can choose


various models for analysis of the
projects and calculate their accuracy
and precision of being successful or
failed.
We start the analysis with Logistic
Regression, then we move on to Naive
Bayes, followed by KNN and several
other methods.

The model is making more


false-positive predictions than
LOGISTIC REGRESSION
false-negative predictions which
Logistic Regression uses a sigmoid
means that there are more predictions
on successful campaigns that are
function as the value
actually unsuccessful than successful
should be in between 0 and 1.
campaigns that are actually
We chose Logistic Regression to be our
successful.
baseline model and split the data into
So, we will try on the same process
70% training, 30% of training data for
with some other models.
validation and 30% for testing
samples. After fitting the different
train, validation and test samples into
BERNOULLI NAIVE BAYES
our baseline model,we got the results
Bernoulli Naive Bayes predicts
in below table:
membership probabilities for each
class such as the probability that a
given record or data point belongs to a
particular class.
This can be expressed mathematically
as:

As the test recall is higher than the


4

K-NEAREST NEIGHBOR
The KNN algorithm assumes the
similarity between the new case/data
and available cases and put the new
case into the category that is most
similar to the available categories.
After fitting the different train,
validation and test samples into our
KNN model,we got the results in below
table:
As the test recall is higher than the
test precision, so there is some scope
for raising our precision score with
other models.

The following table is our confusion


matrix for the KNN algorithm :

The model is making more


false-positive predictions than
false-negative predictions which
means that there are more predictions
on successful campaigns that are
actually unsuccessful than successful
campaigns that are actually
successful.

So, we will try the same process with


some other models.

RANDOM FOREST
5

Random Forest builds decision trees XGBoost, in the form of equation, can
on different samples and takes their be expressed as :
majority vote for classification and
average in case of regression.

Precision Recall Table :

After applying XGBoost on our


dataset, we get to a point where the
below two images gives us our
precision-recall table and the
confusion matrix for the same.

Confusion Matrix :

XGBOOST

XGBoost stands for Extreme Gradient


Boosting, is a distributed decision tree
which provides parallel tree boosting
are created in sequential form. Results and Discussion
6

Curve which gives us the plot between


the following two parameters : True
As the studies and analysis from the
Positive Rate and False Positive Rate.
above algorithms shows, from our
dataset, we have calculated the
accuracy and the precision for all the
models where Gradient Boosting, also
known as XGBoost model performs
the best among all algorithms, while
the remaining other models have fairly
similar performance, as depicted in
table below :

Model LR NB KN RF XGBoos
N t

F-1 0.74 0.74 0.73 0.74 0.75

Recall 0.84 0.85 0.77 0.78 0.80

Precision 0.66 0.66 0.70 0.70 0.70

Accuracy 0.67 0.66 0.68 0.69 0.69


It was also studied that goal and
time-related features are the most
important features in predicting a
On study of confusion matrix, we campaign’s success. This finding
come across various terms such as makes sense because the success of
Precision, F-measure, Recall which are your campaign is decided by whether
defined below : the amount of pledge meets your
goal, and whether you get the amount
of pledge within the campaign
deadline.

Another measure of study is the ROC


Limitations
7

best among all. We usually give


preference to Precision, followed by
While running the Support Vector
other factors such as Accuracy and
machine (SVM) algorithm, we almost
Recall. Precision and Accuracy were
waited for half an hour but the
same among some of the algorithms,
algorithm was still under process and
but XGBoost got an edge in case of
took time.
Recall parameter.
Reading some papers and resources,
we came across a line which states
that SVM is not supported for large
data sets and we decided to skip that Link for the Project and its
algorithm.
Files
The main problem with preprocessing
is that we have to manually do the
feature selection, explained in the https://github.com/slundberg/shap
category encoding section, which
requires a lot of manpower when it
https://drive.google.com/drive/folders
comes to millions and trillions of
/1Y62JNikSS_pu5UQM0PEJYUJ5rjcKiN
datasets.
uK?usp=sharing
While running the XGBoost algorithm,
we saw its performance in predicting
different features was different for References
each subcategory of same parent
category.
● https://www.geeksforgeeks.org/lib
raries-in-python/
● https://ai.stackexchange.com/ques
tions/7202/why-does-training-an-s
Conclusion vm-take-so-long-how-can-i-speed-i
t-up

According to our results section, we ● https://en.wikipedia.org/wiki/Rand


would advise campaigners to: om_forest#:~:text=Random%20fo
rests%20or%20random%20decisio
-> Set a small goal that fits the n,class%20selected%20by%20mos
project scope t%20trees.
-> Keep the campaign duration short ● https://www.datacamp.com/tutoria
l/xgboost-in-python
-> Consider the category carefully
● https://stanford.edu/~kartiks2/kic
in order to increase their chances of
kstarter.pdf
launching successful campaigns!
● https://github.com/slundberg/shap

Out of all the algorithms that we


studied, we found XGBoost to be the

You might also like