AP Statistics Study Guide / Cheat Sheet

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38
At a glance
Powered by AI
The key takeaways are definitions of statistical concepts like distributions, measures of center, spread and shape of distributions, and independence of events. Different types of plots and analyses for distributions and relationships are also discussed.

Measures used to describe distributions include measures of center like mean and median, measures of spread like range, interquartile range (IQR) and standard deviation, and descriptions of shape like symmetrical, skewed left or right.

Independence of events means the probability of one event does not affect the probability of the other, while dependence means the probability of one event does impact the probability of the other. Independence is tested using a comparison of joint and marginal probabilities.

Definitions:

Disjoint/Mutually exclusive - events that cannot occur together (both male and female at the
same time)
Residual = Actual - Predicted

Compare distributions
Both distributions of sale prices are centered around 400,000 and are roughly symmetric
(approximately normal is OK here). The distribution of sale prices in Thunder City (range of
about 400,000) is more variable than that of Primeville (range of about 300,000).

The distribution of number of barbell curls is slightly skewed to the right for both groups, with
more skew in the distribution for those that weren’t encouraged. The median number of curls
and the variability in number of curls is greater for the group that received encouragement.
Neither distribution has any outliers.

Shape​ - approximately symmetrical, skewed left, skewed right


Outlier​ - either just look at the graph
or use outlier formula: Outlier > Q3 + 1.5(IQR) or Outlier < Q1 - 1.5(IQR)
Center​ - median or mean
Variability​ - Standard Deviation (use with mean) and IQR (use with median)

The standard deviation is an especially useful measure of variability when the distribution is
normal or approximately normal (see Chapter on Normal Distributions) because the proportion
of the distribution within a given number of standard deviations from the mean can be calculated

Independence
Because P(home price > 500,000) ≠ P(home price > 500,000 | Primeville), the events “sold in
Primeville” and “sale price over $500,000” are not independent

Because P(W) ≠ P(W | N), the events ________ and _________ are not independent.
Because P(W) = P( W | N), the events _________ and _______ are independent.

Explain:

NOT INDEPENDENT: When something is independent, this means that the probability 
that one event occurs in no way affects the probability of the other event occurring. 
Because ​these numbers​ are not equal, the probability that one event occurs does affect 
the probability of the other event occurring and is therefore not independent. 
 
INDEPENDENT: When something is independent, this means that the probability that 
one event occurs in no way affects the probability of the other event occurring. Because 
these numbers ​are not equal, the probability that one event occurs in no way affects 
the probability of the other event occurring and is therefore independent. 

Convincing Evidence
Zero is within the 90 percent confidence interval of plausible values for the difference in
population means. Therefore this confidence interval does not provide evidence that there is a
difference in mean sale price between the two cities

If zero is in it, it would go both ways.


If interval is all negative, the second proportion is larger.
If interval is all positive, the first proportion is larger.

a significance test is used to determine if the difference between the assumed value in 
the null hypothesis and the value observed from experiment is big enough to reject the 
possibility that the result was a purely chance process. 
 
Because the P-value exceeds our default α = 0.05 significance level, we can’t conclude 
that the company’s n ​ ew AAA batteries last longer than 30 hours, on average. 

Ho: mu = 50
Ha: mu > 50

If​ p value​ is higher than ​alpha​, we do not have convincing evidence that the alternative
hypothesis is true.
If the ___ ​percent ​confidence interval is strictly negative, then there is convincing evidence ​the
second proportion ​is larger.

If the ___ ​percent ​confidence interval is strictly positive, then there is convincing evidence ​the
first proportion ​is larger.

If the ___ ​percent ​confidence interval contains zero, then it is inconclusive.


Therefore, the confidence interval does not provide evidence that there is a difference in mean
context.

Random is Necessary
Without random assignment, it is possible that the difference in task completion times could
be due a confounding variable not related to the cheering students. For example, a volunteer
could improve their time through experience with the task. Randomization “evens out” the
possible effects of potentially confounding variables.
The purpose of random assignment is to create two groups of subjects that are roughly
equivalent in their​ confounding variable.​ This will allow for a cause and-effect conclusion if the
difference between the two groups is statistically significant.

Observational or Experiment
Observational​ - no assignment, have no influence on who does it

Experiment​ -
Control: control group OR blind
Random: participants are randomly assigned, helps eliminate confounding variable
replication - one can perform the experiment on many subjects to reduce chance variation in
results (benefit is reduce variability, use large sample size) “Because the sample size is large,
(n=200), replication applies”

This is an experimental study because it satisfies all conditions of an experiment. The


participants are blind and randomly assigned the Advertisements, which helps nullify any
confounding variables. There is also a large sample size (n=200), meaning there is a lower
chance of variation and replication applies.

Confounding Variable
Age is a potential confounding variable, because it could be related to both napping frequency
and CVD event status for the adults in the study. For example, older adults probably tend to nap
more frequently and also tend to experience more CVD events than younger adults. If both of
these are true, then the result of the study (that adults who nap more frequently tend to
experience more CVD events than adults who nap less frequently) would have been found even
if napping frequency has no effect on the likelihood of having a CVD event.

Possible confounding variable: age, gender, smoking, alcohol, activity level, diet
Adding and Multiplying Mean and Standard Deviation
(Transforming)
If you ​add​ the same constant to all of the data:
Shape stays the same
Center add c
Variability stays the same

Multiply​ by the same constant to all of the data:


Shape:Stays the same
Center:Multiply by c
Variability:multiply by c
SD = c(SD)
Variance = (c*SD)^2 = c^2*SD^2

Distributions

Geometric​ - ​The g
​ eometric distribution​ is a special case of the negative binomial 
distribution​. It deals with the number of trials required for a single success. Thus, the 
geometric distribution​ is a negative binomial ​distribution​ where the number of 
successes (r) is equal to 1. 
 

 
 
Conditions for Geometric Distributions: (Somehow this is BIFS) 
● Each observation falls into one of two categories: Success or Failure (or 
whatever you wish to call them). 
● The probability of success is the same for each observation. 
● The observations are independent. 
● The variable of interest is the number of trials required to obtain the first 
success. 
 

Binomial​ - ​Binomial distribution​ summarizes the number of trials, or observations when 


each trial has the same probability of attaining one particular value. The ​binomial 
distribution​ determines the probability of observing a specified number of successful 
outcomes in a specified number of trials 
 
Conditions for Binomial 
● A fixed number of trials.  
● Each trial is independent of the others.  
● There are only two outcomes.  
● The probability of each outcome remains constant from trial to trial. 
 
Normal​ - ​symmetric d
​ istribution​ where most of the observations cluster around the 
central peak and the probabilities for values further away from the mean taper off 
equally in both directions 
 
Features:
- The mean, mode and median are all equal. 
- The curve is symmetric at the center (i.e. around the mean, μ). 
- Exactly half of the values are to the left of center and exactly half the values are 
to the right. 
- The total area under the curve is 1. 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Plots

Bar Graph

The bars are arranged in order of frequency, so more important


categories are emphasized. By looking at all the bars, it is easy to tell
at a glance which categories in a set of data dominate the others. ​Bar
graphs​ can be either single, stacked, or grouped.
Pie Chart

This kind of graph is helpful when graphing qualitative data, where


the information describes a trait or attribute and is not numerical.
Each slice of pie represents a different category, and each trait
corresponds to a different slice of the pie; some slices usually
noticeably larger than others.
Histogram

This type of graph is used with quantitative data. Ranges of values,


called classes, are listed at the bottom, and the classes with greater
frequencies have taller bars.

A histogram often looks similar to a bar graph, but they are different
because of the ​level of measurement​ of the data. Bar graphs measure
the frequency of categorical data. A categorical variable is one that has
two or more categories, such as gender or hair color. Histograms, by
contrast, are used for data that involve ordinal variables, or things
that are not easily quantified, like feelings or opinions.

Stem and Leaf Plot

A ​stem and leaf plot​ breaks each value of a quantitative data set into
two pieces: a stem, typically for the highest place value, and a leaf for
the other place values. It provides a way to list all data values in a
compact form.
Dot Plot

A ​dot plot​ is a hybrid between a histogram and a ​stem and leaf plot​.
Each quantitative data value becomes a dot or point that is placed
above the appropriate class values.

Scatterplot

A ​scatterplot​ displays data that is paired by using a horizontal axis


(the x-axis), and a vertical axis (the y-axis). The statistical tools of
correlation​ and regression are then used to show trends on the
scatterplot.
Box Plot

boxplot is a method for graphically depicting groups of numerical


data through their quartiles. Box plots may also have lines extending
from the boxes indicating variability outside the upper and lower
quartiles
Segmented Bar Graph

stacked bar graph, is a graph that is used to break down and compare parts of a whole. 
Each bar in the chart represents a whole, and segments in the bar represent different 
parts or categories of that whole
Probability
General Addition Rule: P(A or B) = P(A U B) = P(A) + P(B)
Addition P(A or B) if they overlap = P(A) + P(B) - P(A and B)
Short term - Unpredictable
Long term -Predictable
Law of large numbers - If we do something many many times, the proportion of desired
outcomes will approach its probability.

General Multiplication Rule:


P(A∩B) = P(A) x P(B|A)
If A and B are independent, P(B) = P(B|A) so
Independent: P(A∩B) = P(A) x P(B)

Two way table Venn Diagram

Mutually Exclusive:
P(A and B)=0 since they can’t both happen
P(A or B) = P(A)+P(B)-0

Interpret Probability: If ​someone​ does ​something ​many many times, ​probability ​of the time
this will happen.

Conditional Probability: Probability of A given B = P(A|B)=P(A ∩ B)/P(B)

Independent - the outcome of an event has no effect on the outcome of another event
Check independence : P(A) =P(A|B) = P(A|B​c​)
If this is true it, A and B are independent

Tree Diagram

Confidence Interval

We are ​percent ​confident that the interval from ​low ​to ​high ​captures the difference in the
proportion of ​context.

Confidence Level
In _____% of all possible samples, the interval computed from the sample data will capture the
true (parameter in context)
Quantitative vs Qualitative Data

Quantitative Data​ - ​Quantitative data are measures of values or counts and are 
expressed as numbers. 
 
Example: age, height, (A Number that can be EXACT: 5.6768) 

Qualitative Data ​- ​ data that approximates and characterizes. Qualitative data can be 
observed and recorded. This data type is non-numerical in nature. This type of data is 
collected through methods of observations, one-to-one interview, conducting focus 
groups and similar methods. 
 
Example - types of wood, types of food, numbers that cannot be EXACT (example: 
categories… Kids age 5-10, Kids age 11-15) 
 

Error
Margin of error ​- The margin of error is a statistic expressing the amount of random 
sampling error in the results of a survey. The larger the margin of error, the less 
confidence one should have that a poll result would reflect the result of a survey of the 
entire population. 
 
Formula: Margin of error = Critical v ​ alue​ x Standard error of the sample 
 
Type 1 error​ - ​the rejection of a true null hypothesis (rejecting something that is true) 
 
Null is clean. The water is clean, but we say it’s not. 
 
Type 2 error ​- t​ he non-rejection of a false null hypothesis. (failing to reject something that is false) 
 
Null is clean. The water isn’t clean, but we say it is. 
 
 
Increasing power: increase sample size, increase alpha level, increase difference 
between the null and alternative hypothesis values 

 
4 step plan
TO FIND T STAR​ - 2nd, vars, inverse t, enter CL as %, enter fd-1
Conditions

Random Condition: so we can generalize to the population.

10% Condition: so sampling without replacement is OK.

Normal/Large Counts Condition(only for proportion): so the sampling distribution of the sample
proportions will be approximately Normal and we can use z to find a P-value.

CLT (only for mean)​ - ​given a sufficiently large sample size from a population with a 
finite level of variance, the mean of all samples from the same population will be 
approximately equal to the mean of the population 
 
- Proves that the data is approximately Normal. 
- This allows us to use a t-distribution to estimate the p-value of the test. 
 
 
How to prove? Sample size > 30 

Sampling Method
Match pair ​- ​ It can be used when the experiment has only two treatment conditions; and 
subjects can be grouped into ​pairs​, based on some blocking variable. Then, within each 
pair​, subjects are randomly assigned to different treatments. 
 
Flip a coin - ​2 outcomes (label each person heads or tails, etc)

Role a dice ​- 6 outcomes (label each person 1 - 6, etc)

Block Design ​- ​With a randomized block design, the experimenter divides subjects into 
subgroups called blocks, such that the variability within blocks is less than the 
variability between blocks. Then, subjects within each block are randomly assigned to 
treatment conditions.

Simulation​ - ​using artificially generate data in order to test out a hypothesis or statistical 
method

Double Blind -​ participant AND experimenter don’t know who gets what (placebo or not)
Blind​ - Participant does not know what thing they are getting

Experimental Design Conditions


control-control group,blind
random-randomly assign
replication - perform the experiment on many subjects to reduce chance variation in
results(benefit is reduce variability, use large sample size)

This is an experimental study because it satisfies all conditions of an experiment. The


participants are blind and randomly assigned the Advertisements, which helps nullify any
confounding variables. There is also a large sample size (n=200), meaning there is a lower
chance of variation and replication applies.

Unbiased:
Simple Random: ​ t​ he names of 25 employees being chosen out of a hat from a 
company of 250 employees 
 
Systematic​: assume that in a population of 10,000 people, a statistician selects every 
100th person  
 
Stratified​: one might divide a sample of adults into subgroups by age, like 18–29, 
30–39, 40–49, 50–59, and 60 and above. Take a simple random sample of each strata. 
 
Cluster: d​ ivide the entire population (population of Spain) into different clusters (cities). 
Based on physical closeness. 
 
Biased: 
Convenience: ​Take whoever comes first 
 
Quota: r​ esearchers look for a specific characteristic in their respondents, and then take 
a tailored sample that is in proportion to a population of interest. 
 
Judgement: ​based on expert opinion: a researcher may decide to draw the entire 
sample​ from one "representative" city, even though the population includes all cities. 
 
Snowball:​ people who have many friends are more likely to be recruited into the sample. 
When virtual social networks are used, then this technique is called virtual snowball 
sampling. 
 

Types of Bias:
 
 
 
 
 
Non sampling bias: poor wording or non response 
 
  

https://cdn.filestackcontent.com/79D1WXZQQBmyW0ZnPcyT?cache=true&policy=eyJjYWxsIjo
gWyJyZWFkIiwgInN0YXQiLCAiY29udmVydCJdLCAiZXhwaXJ5IjogNDYyMDM3NzAzMX0%3D
&signature=888b9ea3eb997a4d59215bfbe2983c636df3c7da0ff8c6f85811ff74c8982e34

Name That Significance Test Chart


Sampling Distribution for Proportions:

Sampling Distributions for Means:

http://www.mathonmonday.com/stat%20formulae.pdf
portnet.org/cms/lib/NY01001023/Centricity/Domain/195/Stats_Calc_Sheet.pdf
Hypothesis test decisions:
Quantitative variable(s)
Use 1-sample t-test if comparing known population mean to unknown population
mean.
Use 2-sample t-test if comparing two unknown population means if the two
samples are independent.
Use 1-sample t-test for matched pairs if you want to know if there’s a difference
in two means given that the two samples are NOT independent---matched
on some variable.
Use 1-sample t-interval to estimate unknown population mean.
Use 2-sample t-interval to estimate the difference between two unknown
population means.
Categorical variable(s)
Use 1-proportion z-test if comparing a known population proportion to an
unknown population proportion.
Use 2-proportion z-test if comparing two unknown population proportions if the
samples are independent.
Use 1-proportion z-interval to estimate an unknown population proportion.
Use 2-proportion z-interval to estimate the difference between two unknown
population proportions.

Calculator Stuff
Density curve - Has an area of 1
Normal Distribution
Normal density curve - mean=median, bell shaped
z=stat-para/sd, normcdf(low, upper, mu, sd)

To find the probability that the variable will fall into a certain interval that you supply
Normcdf (lower, upper, mu, sd)

To find associated score with percentile


invNorm(percentile, mean, sd) = z statistic

Binomial Distribution
To find probability that there will be X successes in n trials if there is a probability p of
success for each trial
binompdf(#trials, probability of success, how many successes desired/variable of interest)

To find probability of getting up to the amount of successes desired (cumulative


probability)
binomcdf(#trials, probability of success, how many successes desired/variable of interest)

Geometric Distribution
To find the probability of success on the specified trial #
geometpdf(prob of success, specified trial #)

To find the probability of success on or before the specified trial #


geometcdf(prob of success, specified trial #)

T Distribution
Probability Density Function
Tpdf (x,df)
Distribution Probability
Tcdf (lower,upper,df)

Find critical values t* t star, confidence value for 95% is 0.95


invT(confidence value, df)

https://mathbits.com/MathBits/TISection/Statistics2/Quick%20Reference%20Sheet%20AP%20S
tatistics.pdf

PE and MOE GIVEN Info


Given an interval, PE is the mean of the interval.
MOE is the distance from the mean to one end.

Sentences
ONE VARIABLE RELATIONSHIPS

Describing distributions
SOCV
-Shape​ (left skew, right skew, approx. symmetric),
-​Outlier​ - either just look at the graph
or use outlier formula: Outlier > Q3 + 1.5(IQR) or Outlier < Q1 - 1.5(IQR)
-Center​ (mean, median),
-Variability​ (IQR, std dev, range)

Comparing Distributions
Use COMPARATIVE LANGUAGE for all of SCVO

Mean
A typical value of ​(variable context)​ ​is about​ ​(mean) (units)​.

Standard Deviation
The ​(variable in context)​ typically varies by​ (std dev) (units)​ from the mean of ​(mean)​.

R-Squared​ (use R-Squared when asked about strength/reliability of the


model) (Remember to check sign if calculating from a computer printout)

(percent value of r2)​% of the variability in ​(response variable)​ can be explained by the linear
association between ​(explanatory variable) ​and​ (response variable).

r (correlation coefficient)
If the true relationship between ​(explanatory variable)​ and ​(response variable) ​is linear, then
the linear relation is a ​(weak/moderate/strong) (positive/negative) ​relation.

Y-intercept
When the​ (explanatory variable)​ is zero ​(units)​, the linear model predicts the​ (response
variable) ​will be​ (y-intercept) (units).

Slope
The model predicts that ​(response variable)​ will ​(increase/decrease)​ by approximately ​(value
of slope)​ when the ​(explanatory variable)​ increases by 1​ (units)​ on average.

Residual
Note: when residual > 0, the response variable is greater than predicted value ​(& vice-versa)

The Actual ​(context) ​was about (​residual value) (units) (greater/less) ​than predicted using the
least squares regression line where x = (​x axis explanatory variable)
OR
When the ​(explanatory variable) ​is ​(x-coordinate)​, the actual value of ​(response variable)​ is
(residual value) (units) (greater/less) ​the predicted value.

● Comment on residual plot with random scatter


The residual plot shows a random scatter about the regression line; it appears that a linear
model is a good fit.

● Comment on residual plot with a curved pattern


Based on the curved pattern on this residual plot it appears that a linear model may not be a
good fit.

● Comment on scatter plot with a curved pattern


The curved pattern in this scatterplot reveals that a linear regression model would not be
appropriate for modeling the relationship between these variables.

Outlier
Based on part b an eruption with a 51 minute wait time should be classified as an outlier
because it is more than 2 standard deviations from the mean.
However the point is not an outlier on the scatterplot because it is less than 1 standard
deviation from the least squares regression line (5<6.63).

Confidence Intervals
We are ​(CL)​% confident that the interval from ​(lower bound​) to ​(upper bound)​ captures the
true​ (parameter)​.

P-value

FOR ONE: If (H0 in context) is true, there is a ​(p-value)​ probability of a sample of size ​(n)
having a ​(parameter) (greater than/less than/as extreme as) (statistic value)​.

FOR DIFFERENCE: Assume there’s no difference in the​ (proportion/means)​, there is a


(p-value​)​ probability of a sample of size ​(n)​ having a difference in​ (​proportions/means)
(greater than/less than or equal to) (statistic value).

^^^^^^^^
If your alternative hypothesis is p1 - p2 < 0: difference in proportions less than phat.
If your alternative hypothesis is p1 - p2 > 0: difference in proportions is greater than phat.
If your alternative hypothesis is p1 - p2 ≠ 0: difference in proportions is greater than phat and
less than negative phat.

Statistic value ​= the difference of p hat or x bar, meaning phat1-phat2.

● P Value - Conclusion
Notes: if p value is less than alpha, the data supports Ha
if p value is greater than alpha, the data does not support Ha.
Since the p-value is ​(greater than/less than)​ alpha, the data ​(does not /does)​ support ​(Ha in
context)​ when the experiment repeated in similar conditions.

comment on whether there would be a difference in the population mean discoloration


ratings for the treated and untreated strawberries.

a) Because the 95​% ​confidence interval is strictly positive and zero is not a plausible value,
the interval supports a difference meaning the interval doesn’t support mu1-mu2.

Type I Error​ – Reject H0 when it’s true


A Type II Error is when we erroneously determine that ​(alternative hypothesis in context)​.

Type II Error ​– Fail to reject H0 when it’s false


A Type I Error is when we erroneously determine that ​(null hypothesis in context)​ when in fact
(alternative hypothesis)​.

Power
The power in this situation is probability of correctly determining that ​(alternative hypothesis)
when the true value of the ​(parameter in context)​ is ​(alternative value)​.

Z-Score
This (​context)​ is ​(|z score|)​ standard deviations ​(greater/less than)​ the mean​ (​context)​.

Example: This wait time was 2.98 standard deviations less than the mean wait time.

You might also like