AP Statistics Study Guide / Cheat Sheet
AP Statistics Study Guide / Cheat Sheet
AP Statistics Study Guide / Cheat Sheet
Disjoint/Mutually exclusive - events that cannot occur together (both male and female at the
same time)
Residual = Actual - Predicted
Compare distributions
Both distributions of sale prices are centered around 400,000 and are roughly symmetric
(approximately normal is OK here). The distribution of sale prices in Thunder City (range of
about 400,000) is more variable than that of Primeville (range of about 300,000).
The distribution of number of barbell curls is slightly skewed to the right for both groups, with
more skew in the distribution for those that weren’t encouraged. The median number of curls
and the variability in number of curls is greater for the group that received encouragement.
Neither distribution has any outliers.
The standard deviation is an especially useful measure of variability when the distribution is
normal or approximately normal (see Chapter on Normal Distributions) because the proportion
of the distribution within a given number of standard deviations from the mean can be calculated
Independence
Because P(home price > 500,000) ≠ P(home price > 500,000 | Primeville), the events “sold in
Primeville” and “sale price over $500,000” are not independent
Because P(W) ≠ P(W | N), the events ________ and _________ are not independent.
Because P(W) = P( W | N), the events _________ and _______ are independent.
Explain:
NOT INDEPENDENT: When something is independent, this means that the probability
that one event occurs in no way affects the probability of the other event occurring.
Because these numbers are not equal, the probability that one event occurs does affect
the probability of the other event occurring and is therefore not independent.
INDEPENDENT: When something is independent, this means that the probability that
one event occurs in no way affects the probability of the other event occurring. Because
these numbers are not equal, the probability that one event occurs in no way affects
the probability of the other event occurring and is therefore independent.
Convincing Evidence
Zero is within the 90 percent confidence interval of plausible values for the difference in
population means. Therefore this confidence interval does not provide evidence that there is a
difference in mean sale price between the two cities
a significance test is used to determine if the difference between the assumed value in
the null hypothesis and the value observed from experiment is big enough to reject the
possibility that the result was a purely chance process.
Because the P-value exceeds our default α = 0.05 significance level, we can’t conclude
that the company’s n ew AAA batteries last longer than 30 hours, on average.
Ho: mu = 50
Ha: mu > 50
If p value is higher than alpha, we do not have convincing evidence that the alternative
hypothesis is true.
If the ___ percent confidence interval is strictly negative, then there is convincing evidence the
second proportion is larger.
If the ___ percent confidence interval is strictly positive, then there is convincing evidence the
first proportion is larger.
Random is Necessary
Without random assignment, it is possible that the difference in task completion times could
be due a confounding variable not related to the cheering students. For example, a volunteer
could improve their time through experience with the task. Randomization “evens out” the
possible effects of potentially confounding variables.
The purpose of random assignment is to create two groups of subjects that are roughly
equivalent in their confounding variable. This will allow for a cause and-effect conclusion if the
difference between the two groups is statistically significant.
Observational or Experiment
Observational - no assignment, have no influence on who does it
Experiment -
Control: control group OR blind
Random: participants are randomly assigned, helps eliminate confounding variable
replication - one can perform the experiment on many subjects to reduce chance variation in
results (benefit is reduce variability, use large sample size) “Because the sample size is large,
(n=200), replication applies”
Confounding Variable
Age is a potential confounding variable, because it could be related to both napping frequency
and CVD event status for the adults in the study. For example, older adults probably tend to nap
more frequently and also tend to experience more CVD events than younger adults. If both of
these are true, then the result of the study (that adults who nap more frequently tend to
experience more CVD events than adults who nap less frequently) would have been found even
if napping frequency has no effect on the likelihood of having a CVD event.
Possible confounding variable: age, gender, smoking, alcohol, activity level, diet
Adding and Multiplying Mean and Standard Deviation
(Transforming)
If you add the same constant to all of the data:
Shape stays the same
Center add c
Variability stays the same
Distributions
Geometric - The g
eometric distribution is a special case of the negative binomial
distribution. It deals with the number of trials required for a single success. Thus, the
geometric distribution is a negative binomial distribution where the number of
successes (r) is equal to 1.
Conditions for Geometric Distributions: (Somehow this is BIFS)
● Each observation falls into one of two categories: Success or Failure (or
whatever you wish to call them).
● The probability of success is the same for each observation.
● The observations are independent.
● The variable of interest is the number of trials required to obtain the first
success.
Plots
Bar Graph
A histogram often looks similar to a bar graph, but they are different
because of the level of measurement of the data. Bar graphs measure
the frequency of categorical data. A categorical variable is one that has
two or more categories, such as gender or hair color. Histograms, by
contrast, are used for data that involve ordinal variables, or things
that are not easily quantified, like feelings or opinions.
A stem and leaf plot breaks each value of a quantitative data set into
two pieces: a stem, typically for the highest place value, and a leaf for
the other place values. It provides a way to list all data values in a
compact form.
Dot Plot
A dot plot is a hybrid between a histogram and a stem and leaf plot.
Each quantitative data value becomes a dot or point that is placed
above the appropriate class values.
Scatterplot
stacked bar graph, is a graph that is used to break down and compare parts of a whole.
Each bar in the chart represents a whole, and segments in the bar represent different
parts or categories of that whole
Probability
General Addition Rule: P(A or B) = P(A U B) = P(A) + P(B)
Addition P(A or B) if they overlap = P(A) + P(B) - P(A and B)
Short term - Unpredictable
Long term -Predictable
Law of large numbers - If we do something many many times, the proportion of desired
outcomes will approach its probability.
Mutually Exclusive:
P(A and B)=0 since they can’t both happen
P(A or B) = P(A)+P(B)-0
Interpret Probability: If someone does something many many times, probability of the time
this will happen.
Independent - the outcome of an event has no effect on the outcome of another event
Check independence : P(A) =P(A|B) = P(A|Bc)
If this is true it, A and B are independent
Tree Diagram
Confidence Interval
We are percent confident that the interval from low to high captures the difference in the
proportion of context.
Confidence Level
In _____% of all possible samples, the interval computed from the sample data will capture the
true (parameter in context)
Quantitative vs Qualitative Data
Quantitative Data - Quantitative data are measures of values or counts and are
expressed as numbers.
Example: age, height, (A Number that can be EXACT: 5.6768)
Qualitative Data - data that approximates and characterizes. Qualitative data can be
observed and recorded. This data type is non-numerical in nature. This type of data is
collected through methods of observations, one-to-one interview, conducting focus
groups and similar methods.
Example - types of wood, types of food, numbers that cannot be EXACT (example:
categories… Kids age 5-10, Kids age 11-15)
Error
Margin of error - The margin of error is a statistic expressing the amount of random
sampling error in the results of a survey. The larger the margin of error, the less
confidence one should have that a poll result would reflect the result of a survey of the
entire population.
Formula: Margin of error = Critical v alue x Standard error of the sample
Type 1 error - the rejection of a true null hypothesis (rejecting something that is true)
Null is clean. The water is clean, but we say it’s not.
Type 2 error - t he non-rejection of a false null hypothesis. (failing to reject something that is false)
Null is clean. The water isn’t clean, but we say it is.
Increasing power: increase sample size, increase alpha level, increase difference
between the null and alternative hypothesis values
4 step plan
TO FIND T STAR - 2nd, vars, inverse t, enter CL as %, enter fd-1
Conditions
Normal/Large Counts Condition(only for proportion): so the sampling distribution of the sample
proportions will be approximately Normal and we can use z to find a P-value.
CLT (only for mean) - given a sufficiently large sample size from a population with a
finite level of variance, the mean of all samples from the same population will be
approximately equal to the mean of the population
- Proves that the data is approximately Normal.
- This allows us to use a t-distribution to estimate the p-value of the test.
How to prove? Sample size > 30
Sampling Method
Match pair - It can be used when the experiment has only two treatment conditions; and
subjects can be grouped into pairs, based on some blocking variable. Then, within each
pair, subjects are randomly assigned to different treatments.
Flip a coin - 2 outcomes (label each person heads or tails, etc)
Block Design - With a randomized block design, the experimenter divides subjects into
subgroups called blocks, such that the variability within blocks is less than the
variability between blocks. Then, subjects within each block are randomly assigned to
treatment conditions.
Simulation - using artificially generate data in order to test out a hypothesis or statistical
method
Double Blind - participant AND experimenter don’t know who gets what (placebo or not)
Blind - Participant does not know what thing they are getting
Unbiased:
Simple Random: t he names of 25 employees being chosen out of a hat from a
company of 250 employees
Systematic: assume that in a population of 10,000 people, a statistician selects every
100th person
Stratified: one might divide a sample of adults into subgroups by age, like 18–29,
30–39, 40–49, 50–59, and 60 and above. Take a simple random sample of each strata.
Cluster: d ivide the entire population (population of Spain) into different clusters (cities).
Based on physical closeness.
Biased:
Convenience: Take whoever comes first
Quota: r esearchers look for a specific characteristic in their respondents, and then take
a tailored sample that is in proportion to a population of interest.
Judgement: based on expert opinion: a researcher may decide to draw the entire
sample from one "representative" city, even though the population includes all cities.
Snowball: people who have many friends are more likely to be recruited into the sample.
When virtual social networks are used, then this technique is called virtual snowball
sampling.
Types of Bias:
Non sampling bias: poor wording or non response
https://cdn.filestackcontent.com/79D1WXZQQBmyW0ZnPcyT?cache=true&policy=eyJjYWxsIjo
gWyJyZWFkIiwgInN0YXQiLCAiY29udmVydCJdLCAiZXhwaXJ5IjogNDYyMDM3NzAzMX0%3D
&signature=888b9ea3eb997a4d59215bfbe2983c636df3c7da0ff8c6f85811ff74c8982e34
http://www.mathonmonday.com/stat%20formulae.pdf
portnet.org/cms/lib/NY01001023/Centricity/Domain/195/Stats_Calc_Sheet.pdf
Hypothesis test decisions:
Quantitative variable(s)
Use 1-sample t-test if comparing known population mean to unknown population
mean.
Use 2-sample t-test if comparing two unknown population means if the two
samples are independent.
Use 1-sample t-test for matched pairs if you want to know if there’s a difference
in two means given that the two samples are NOT independent---matched
on some variable.
Use 1-sample t-interval to estimate unknown population mean.
Use 2-sample t-interval to estimate the difference between two unknown
population means.
Categorical variable(s)
Use 1-proportion z-test if comparing a known population proportion to an
unknown population proportion.
Use 2-proportion z-test if comparing two unknown population proportions if the
samples are independent.
Use 1-proportion z-interval to estimate an unknown population proportion.
Use 2-proportion z-interval to estimate the difference between two unknown
population proportions.
Calculator Stuff
Density curve - Has an area of 1
Normal Distribution
Normal density curve - mean=median, bell shaped
z=stat-para/sd, normcdf(low, upper, mu, sd)
To find the probability that the variable will fall into a certain interval that you supply
Normcdf (lower, upper, mu, sd)
Binomial Distribution
To find probability that there will be X successes in n trials if there is a probability p of
success for each trial
binompdf(#trials, probability of success, how many successes desired/variable of interest)
Geometric Distribution
To find the probability of success on the specified trial #
geometpdf(prob of success, specified trial #)
T Distribution
Probability Density Function
Tpdf (x,df)
Distribution Probability
Tcdf (lower,upper,df)
https://mathbits.com/MathBits/TISection/Statistics2/Quick%20Reference%20Sheet%20AP%20S
tatistics.pdf
Sentences
ONE VARIABLE RELATIONSHIPS
Describing distributions
SOCV
-Shape (left skew, right skew, approx. symmetric),
-Outlier - either just look at the graph
or use outlier formula: Outlier > Q3 + 1.5(IQR) or Outlier < Q1 - 1.5(IQR)
-Center (mean, median),
-Variability (IQR, std dev, range)
Comparing Distributions
Use COMPARATIVE LANGUAGE for all of SCVO
Mean
A typical value of (variable context) is about (mean) (units).
Standard Deviation
The (variable in context) typically varies by (std dev) (units) from the mean of (mean).
(percent value of r2)% of the variability in (response variable) can be explained by the linear
association between (explanatory variable) and (response variable).
r (correlation coefficient)
If the true relationship between (explanatory variable) and (response variable) is linear, then
the linear relation is a (weak/moderate/strong) (positive/negative) relation.
Y-intercept
When the (explanatory variable) is zero (units), the linear model predicts the (response
variable) will be (y-intercept) (units).
Slope
The model predicts that (response variable) will (increase/decrease) by approximately (value
of slope) when the (explanatory variable) increases by 1 (units) on average.
Residual
Note: when residual > 0, the response variable is greater than predicted value (& vice-versa)
The Actual (context) was about (residual value) (units) (greater/less) than predicted using the
least squares regression line where x = (x axis explanatory variable)
OR
When the (explanatory variable) is (x-coordinate), the actual value of (response variable) is
(residual value) (units) (greater/less) the predicted value.
Outlier
Based on part b an eruption with a 51 minute wait time should be classified as an outlier
because it is more than 2 standard deviations from the mean.
However the point is not an outlier on the scatterplot because it is less than 1 standard
deviation from the least squares regression line (5<6.63).
Confidence Intervals
We are (CL)% confident that the interval from (lower bound) to (upper bound) captures the
true (parameter).
P-value
FOR ONE: If (H0 in context) is true, there is a (p-value) probability of a sample of size (n)
having a (parameter) (greater than/less than/as extreme as) (statistic value).
^^^^^^^^
If your alternative hypothesis is p1 - p2 < 0: difference in proportions less than phat.
If your alternative hypothesis is p1 - p2 > 0: difference in proportions is greater than phat.
If your alternative hypothesis is p1 - p2 ≠ 0: difference in proportions is greater than phat and
less than negative phat.
● P Value - Conclusion
Notes: if p value is less than alpha, the data supports Ha
if p value is greater than alpha, the data does not support Ha.
Since the p-value is (greater than/less than) alpha, the data (does not /does) support (Ha in
context) when the experiment repeated in similar conditions.
a) Because the 95% confidence interval is strictly positive and zero is not a plausible value,
the interval supports a difference meaning the interval doesn’t support mu1-mu2.
Power
The power in this situation is probability of correctly determining that (alternative hypothesis)
when the true value of the (parameter in context) is (alternative value).
Z-Score
This (context) is (|z score|) standard deviations (greater/less than) the mean (context).
Example: This wait time was 2.98 standard deviations less than the mean wait time.