G20 - Crowdfunding Predicting Kickstarter Project Success
G20 - Crowdfunding Predicting Kickstarter Project Success
G20 - Crowdfunding Predicting Kickstarter Project Success
CATEGORY ENCODING
Dataset and Features Preprocess our features and dummy
categorical variables before looking
DATA into the data further.
Projects on Kickstarter combine a We did data preprocessing where we
variety of data types. There are deleted duplicate or redundant
almost 38 columns or we can say columns and split time from its UNIX
there are almost 38 entries for each format to normal time zone format.
dataset which includes fundraising We also splitted the countries and
goals, creator details, details of the states into their subcategories to
project or startup, backstory, understand the data more precisely.
categories, photos, videos. We also divided all the projects into
We use a well-maintained repository their parent categories where the
containing data for over 2,50,000 parent category was further divided
Kickstarter projects. The authors into subcategories which we named
designed a robot to crawl Kickstarter child categories which would make it
each month and scrape the labeled easy for Kickstarters to see or
HTML output into CSV files. This understand and explore their
output contains all the information we categories.
use for this analysis, including the As discussed above, each Kickstarter
funding goal, project categories, and a project is associated with a variety of
short project blurb. Although the data categorical variables, including the
is stored as a CSV, many of these project category (e.g. Board Games,
features are stored as JSON strings, Documentary, etc.) and the parent
which we expand to obtain our desired category (e.g. Gaming, Film, etc.), the
project variables. creator's city and the country of origin
We attempt only to select information as well as the currency of the
that would be available at the time of donation.
launch. However, since our data
comes from a monthly snapshot, if a We divided our dataset into two
creator were to edit their project categories in the ratio of 70:30. While
metadata after launch, we would not the minority ratio is for testing
3
purposes, the 70% of the dataset is test precision, so there is some scope
used for training purposes under for raising our precision score with
which 30% of the 70% which is other models.
technically 21% of the original dataset
is for validation purposes.
Models
K-NEAREST NEIGHBOR
The KNN algorithm assumes the
similarity between the new case/data
and available cases and put the new
case into the category that is most
similar to the available categories.
After fitting the different train,
validation and test samples into our
KNN model,we got the results in below
table:
As the test recall is higher than the
test precision, so there is some scope
for raising our precision score with
other models.
RANDOM FOREST
5
Random Forest builds decision trees XGBoost, in the form of equation, can
on different samples and takes their be expressed as :
majority vote for classification and
average in case of regression.
Confusion Matrix :
XGBOOST
Model LR NB KN RF XGBoos
N t