Ncaa ML Competition
Ncaa ML Competition
Ncaa ML Competition
Women’s
Zijing Wang, Xuanyu Liang, Kevin Xue, Xiao Wang, Jiacheng
Shi, Minsheng Liu, and Junxian Tan.
March 25, 2019
1 Introduction
2 Background
3 Data
The data of this study is composed of the games boxscores from 1998
to 2018, including boxscores results from both regular seasons and
tournaments. Detailed box scores are provided for the tournaments
between 2008 and 2018 and regular seasons between 1998 and 2018.
There are 10 files in the data folder that include information about the
season, location, TeamID, seeds, and specific team box scores. Table 1
lists all fields and their descriptions.
4 Analysis
Field Description
TeamID Identification number (4-digit) of each NCAA Women team
TeamName Name of the team
Season Year of which the tournament is played
DayZero The date corresponding to daynum=0 of the season
Region in W, X, Y, Z; identifier of the region
Seed The team’s seed in one of the corresponding region
DayNum The number tells what day the games took place with respect to DayZero
WScore Points Scored by the winning team
LScore Points Scored by the losing team
NumOT The number of overtime in the game
WLoc The location of the winning team (H: home, A: away N:neutral court)
WFGM Winning Team’s Field Goals Made
WFGA Winning Team’s Field Goals Attempted
WFGM3 Winning Team’s Three Pointers Made
WFGA3 Winning Team’s Three Pointers Attempted
WFTM Winning Team’s Free Throws Made
WFTA Winning Team’s Free Throws Attempted
WOR Winning Team’s Offensice Rebounds
WDR Winning Team’s Defensive Rebounds
WAst Winning Team’s Assists
WTO Winning Team’s Turnovers Commiteed
WStl Winning Team’s Steals
WBlk Winning Team’s Blocks
WPF Winning Team’s Personal Fouls Committed
Table 1: All fields and their descriptions.
Figure 1: Histogram of number of games
win at home or away during regular
seasons
google cloud and ncaa ml competition 2019-women’s 5
In Table 1 and Table 2, the p-value are all less than 0.5. With sig-
nificance level α=0.05, we reject the null hypothesis and accept the
alternative. By conducting a 95% confidence interval, we are 95%
confident that the true home-win rate p is higher than 0.50. Hence,
from the one-sided hypothesis test, we conclude that there exist home
advantage over the last decade during both regular seasons and tour-
neys. The results confirmed our assumption, which is competitive
sports have home-advantage. Since the home team have psychological
advantages and more familiar with the environment at home, they can
pay more attention on games rather than struggling on their travels.
from the regular season follow the same distribution as that from the
tourney.
Before the test, we use a bar plot to visually compare the mean of
score differences, where our data are divided by year, shown in figure
3. Surprisingly, we do not see much discrepancy between the mean
from the regular season and that from the tourney. In year 2010 to
2013 as well as 2016, the mean from the regular season is higher than
that from the tourney, but the situation is reversed in other years.
15
Mean Diff
Stage
10 Regular
Tourney
In any case, there is difference between the regular season and the
tourney. We perform the chi-squared test to see if such difference is
significant. The test is done separately for each year. Table 5 shows the
test results.
As we can see, for all years p > 0.05, meaning that we cannot reject
hypothesis 4.2, and that we are 95% confident that the mean score
google cloud and ncaa ml competition 2019-women’s 8
powerful test to test for normality in general. We set the null hypothe-
sis of K-S test to be the distribution is normal. The test result gives us a
p-value less than 0.05, so we rejected the null hypothesis. The second
distribution is not normal.
30 30 30
Wins
Wins
Wins
20 20 20
10 10 10
0 0 0
60 80 100 120 60 80 100 120 60 80 100 120
Rating Rating Rating
2013 2014 2015
40 40 40
30 30 30
Wins
Wins
Wins
20 20 20
10 10 10
0 0 0
60 80 100 120 60 80 100 120 60 80 100 120
Rating Rating Rating
2016 2017 2018
40 40 40
30 30 30
Wins
Wins
Wins
20 20 20
10 10 10
0 0 0
60 80 100 120 60 80 100 120 60 80 100 120
Rating Rating Rating
It is interesting that the net rating for all teams do not have a normal
distribution. This might because it is unlikely for a team to have a
zero net rating, since a zero net rating means that this team gets and
losses the same amount of points at the same time. But we can see that
winning team still will have a higher net rating than all the net rating.
The last thing we look at is the free throw. As we can tell from the
graph, there is no significant difference between winning teams and
all the teams. But the losing team do have a larger tail to the left, and a
higher peak.
After our exploratory data analysis, we are now well-quipped for our
ultimate goal: predicting who would win in the tourney. In this sec-
tion, we employ machine learning techniques to solve the problem.
We would prepare a common training set and evaluate three mod-
els: support vector machine with linear kernels, random forest, and
AdaBoost.
We focus on predicting the result for any given match in the tour-
ney, given the two participating teams. The final winner can be pre-
dicted by using our estimated predictors repetitively following the
tourney rules.
The core idea behind our approach is to build a profile for each
team using their performance in the regular season. We assume that
there exists a prediction function, which takes two teams’ profile
as input and output the winner. Our work here is to estimate this
function.
In the first hypothesis, we test whether the location of the match
has impact on the match results, and the conclusion is positive. As a
result, we, in addition to the team profile, also supply location infor-
mation to our predictor.
Which information should be used to build the team profile? Ma-
chine learning algorithms are good at finding important features
among all available variables automatically. For that to work, how-
ever, we need to supply enough information. Moreover, we need to
avoid providing useless information.
Our second and third hypotheses shows that match score could
fluctuate depending on whether a team is strong or weak. However,
any team that made into the tourney is strong, and the absolute value
of score does not determine who wins and who loses. Therefore, we
decide to leave out match score from our team profile.
Our fourth and fifth hypotheses checks whether advanced metrics
give a good insight in the result of a match. While information like of-
fensive rating is correlated with the outcome a match, some others are
google cloud and ncaa ml competition 2019-women’s 17
Data Preparation
We first assign each team in each year a unique identifier, which
means each team would occur nine times. Then, we compute the
mean, median, standard deviation, and skewness for each advanced
metric. There are 13 advanced metrics, and we obtain a 13 × 4 = 52
dimension vector for each team serving as the team profile.
Then, for each match result in the given data, we get the team pro-
file vector for both teams. We append the row with a two-dimension
vector denoting which team is the host. It is similar to the one-hot
vector, with a slight difference that both dimensions can be zero if the
match happens in a neutral place.
Lastly, we generate the labels, naming the match result. Notice that
the data are organized that the first team is the winning one, but this
might allow machine learning models to over-fit. Hence, we randomly
toss a fair coin and decide for each match, whether the winning or the
losing team is the first one.
google cloud and ncaa ml competition 2019-women’s 18
Results
We reserve 25% of the whole data (about 40k matches) as the vali-
dation set. We evaluate three models: linear support vector machine
with hard margin, random forest, and AdaBoost. Due to the pro-
hibitively high running cost, we only train these models on a small
portion of the training set, but evaluating them on the full validation
set.
We choose to train on 500, 1000, . . . , 7500, 8000 matches for all three
models. Figure 14 summarizes the performance of all three models.
0.675
The model that performs the best is linear support vector machine,
followed by random forest, and the worst-performing one is Ad-
aBoost. The final performance for the three models, where the training
set contains 8000 matches, are 74.6%, 73.1%, and 72.6% respectively.
The performance for all three models increases steadily with more
training data. Moreover, from the figure we guess that we have not yet
reached the asymptotic line, that more computational power we can
have even better performance.
6 Theory
trials. ( )
n x
P (X = x) = p (1 − p)n−x
x
MLE of Binomial distribution: For a binomial distribution with
parameter n and p, the probability mass function is
( )
n x
P (X = x) = p (1 − p)n−x
x
In order to calculate the MLE, we first derive the likelihood function:
∏n ( )
n xi
L(p) = p (1 − p)n−xi
i=1
x
Then, take the first and second derivative of the log-likelihood func-
tion:
∑n ∑n
dℓ i=1 xi n − i=1 xi
= −
dp p 1−p
∑n
d2 ℓ −x n − i=1 xi
= 2 − <0
dp2 p (1 − p)2
At last, set the first derivative equals to zero and derive the MLE as
the following: ∑n
xi
p̂ = i=1
n
Offensive Efficiency: For calculating the offensive efficiency, we
count the number of points a team scores every 100 possessions.
Defensive Efficiency: For calculating the defensive efficiency, we
count the number of points a team allows every 100 possessions.
Mean: the mean of the sample is taking the sum of all the data in
sample and divide it by sample size, the mean of X is usually denoted
byX̄,
Median: it is a value that indicates the middle of the data sample,
located in the middle of the data sample.
Standard Deviation: Standard deviation is the summation of the
distance between data points and the median of the data.
Variance: The variance is the square of the standard deviation
Quantile: The First Quantile represents the cutoff point between
the first 25 percent of the data and the remaining 75 percent of the
data.The Third Quantile represents the cutoff point between the first
75 percent of the data and the remaining 25 percent of the data.
Linear Regression: Linear regression is a linear approach to
model the relationship between two variables.
google cloud and ncaa ml competition 2019-women’s 20
∑
k
(xi − mi )
2
X2 =
i=1
mi
Dn = sup|Fn (x) − Fn |
x
based on some linear algebra, which is called kernel trick. It has linear,
polynomial, and exponential kernels.
Ada boosting: adaptive boosting is a popular machine learning
meta-algorithm for classification, regression and other tasks, which
can incorporate multiple weak classifiers into a single strong classi-
fier. Basically, Ada boosting can decide how much weight should be
given to each classifier. Also, based on the results of the original clas-
sifier, we can select the training set for new classifier. The following
introduction is given by Trevor Hastie from Stanford University.
1
1. Initialize the observation weights wi = N,i = 1, 2, ..., N.
c. Compute
1 − errm
αm = log[ ]
errm
d. Update weights
∑M
3. Output C(x)= sign [ m=1 αm ×Cm (x)
7 Limitation
8 Conclusion
conclude that offensive rating and net rating could affect the chance of
winning while free throw rates and defensive ratings do not seem to
affect the winning chance.
Built upon our knowledge, we use advanced machine learning to
provide a prediction for the 2019 tournament result.
9 Works Cited