R Cheat Sheet 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

STAT 511

Overview and Descriptive Statistics

Prof. Michael Levine

September 3, 2018

Levine STAT 511


Populations and Samples

I A population is a well-defined collection of objects.


I An example: all gelatin capsules of a particular type produced
during a specified period
I When information is available for the entire population we
have a census
I A subset of the population is a sample
I Example: a sample of bearings from a particular production
run

Levine STAT 511


Populations and Samples

I A variable is any characteristic whose value may change from


one object to another in the population Univariate data
consists of observations on a single variable
I An example: a type of transmission (automatic or manual) on
each of ten automobiles recently purchased
I Multivariate data - observations made on more than two
variables.
I An example: a record of systolic and diastolic blood pressure
as well as the serum cholesterol level for each patient

Levine STAT 511


Example of an Experiment: Coin Spinning

I It is claimed sometimes that the coin spinning produces


outcomes different from tossing the same coin. In particular,
one often hears that the likelihood of obtaining heads (H) is
less than 50% in that case
I 1. Choose a penny. What is the chance of obtaining H by
spinning that penny? Also, are two pennies equally likely to
produce H when spun?
2. Choose several pennies minted before 1982 and several pennies
minted after 1982. As groups, are pre-1982 pennies and
post-1982 pennies equally likely to produce H when spun?
(Before 1982 - 95% copper and 5% zinc; after 1982 - 97.5%
zinc and 2.5% copper)

Levine STAT 511


Branches of Statistics

I Often, the first task after the data collection is to summarize


the data
I This may involve using both graphical methods and
calculation of numerical summary measures.
I Descriptive Statistics - summary and description of collected
data.
I After obtaining a sample from the population, it is frequently
necessary to draw some conclusion (make an inference) about
the population as a whole
I Inferential Statistics - generalizing from a sample to a
population

Levine STAT 511


Randomization:Lanarkshire milk experiment

I Lanarkshire milk experiment: 5000 children received a daily


supplement of 3/4 pint of raw milk, 5000 received 3/4 pint of
pasteurized milk and 10, 000 children received no daily
supplement
I Each child was weighed (while wearing indoor clothing) and
measured for hight in February of 1930 (before the start of
the study) and in June of 1930 (after the end of the study).
I The final observations of the control group exceeded those of
the treatment groups by average amounts equivalent to 3
months growth of weight and 4 months of growth in
height...Why?!

Levine STAT 511


Randomization:Lanarkshire milk experiment

I Initially, the division into treatment vs. control was arbitrary,


e.g. using the alphabet
I If the initial division produced groups with unbalanced
numbers of well-fed or ill-nourished children, teachers were
allowed to swap children between the two groups to produce
(apparently) better balanced groups...Sounds interesting? It
should!!
I Another culprit was the quality of clothing worn by children...

Levine STAT 511


The pre-election polls of 1948

I Presidential elections of 1948 : Truman vs. Dewey (he of the


two Hollywood films fame :”Smashing the Rackets” and
”Racket Busters”)
I Crossley poll: 50% vs. 45%; Gallup poll 50% vs. 44%; Roper
poll: 53% vs. 38%
I The actual result: slightly less than 50% vs. slightly more
than 45%...Why?!

Levine STAT 511


Quota sampling

I First, one selects some important characteristics of the


population, e.g. age, sex, race etc.
I Second, one attempts to obtain a sample that mimicks the
general population with respect to these characteristics. For
example, out of 15 people, there should be 7 men and 8
women. Out of 7 men, 3 have to be under 40 years of age
etc...
I The problem here is nobody specifies how to choose within
quotas. Commonly, interviewers select people with higher
educational levels because they are easier to deal with.
I More educated voters tended to be more affluent and voted
Republican in higher numbers...

Levine STAT 511


Randomization in practice: Case I

I There is a need to test two versions of the final exam. Let’s


say, we have 40 students..
I Write the names of students on the slips of paper and put
them in a jar
I Sample without replacement 20 names - those will make up
Group 1
I The rest will make up the group 2

Levine STAT 511


Randomization in practice: Case II

I Now imagine that we know in advance that the class


comprises 30 freshmen and 10 nonfreshmen. We believe it is
essential to have 3/4 freshmen in each group
I Create 40 slips of paper again but now separate freshman and
nonfreshman slips.
I First, draw 15 slips out of 30 freshmen slips; 15 receive exam
A, the rest-exam B
I Second, draw 5 slips out of 10 remaining slips; 5 receive exam
A, the rest - exam B
I This is called stratified random sampling

Levine STAT 511


Histograms of Discrete Data

I Determine the frequency and relative frequency for each value


of x.
I Mark possible values of x on a horizontal scale
I Above each value, draw a rectangle whose height is the
relative frequency of that value

Levine STAT 511


Example

I Students from a small college were asked how many charge


cards they carry. x is the variable representing the number of
cards
x # people Rel. Freq
0 12 0.08
1 42 0.28
2 57 0.38
3 24 0.16
4 9 0.06
5 4 0.03
6 2 0.01

Levine STAT 511


Example

I Number of hits per game for all nine-inning games that were
played between 1989-1993 in a major league baseball
I The histogram seems to be unimodal but not symmetric

Levine STAT 511


Figure :

Levine STAT 511


Histograms for Continuous Data: Equal Class Widths

I The first step is splitting the data into a suitable number of


class intervals
I As an example, let the variable of interest be the fuel
efficiency of an automobile in mpg; the smallest observation is
27.8 and the largest is 31.4
I Commonly, an observation on the boundary placed in the
interval to the right of the boundary: 27.5− < 28.0,
28.0− < 28.5, 28.5− < 29.0 etc.
I Determine the (relative) frequency for each class. Then,
above each class interval, draw a rectangle whose height is the
(relative) frequency.

Levine STAT 511


Histograms for continuous data with equal class widths: an
example

I The data is a sample of adjusted consumption value during a


particular period for 90 gas-heated homes in Wisconsin (in
BTU’s)
I The adjusted consumption is determined as the ratio:
consumption
adjusted consumption =
(weather in degree days) (house area)
I The general rule
√ of thumb is that the number of classes is
approximately number of observations.

Levine STAT 511


Histograms for continuous data with equal class widths: an
example

Figure :

Levine STAT 511


Histograms for Continuous Data: Unequal Widths

I Sometimes, unequal widths are called for, especially if the


data is not uniformly concentrated over the range
I After determining frequencies and relative frequencies,
calculate the height of each rectangle using the formula:
rel. frequency of the class
rectangle height =
class width
I The resulting heights are called densities. The vertical scale is
called the density scale.

Levine STAT 511


Histograms for continuous data with unequal class widths:
an example

I The problem is the corrosion of the reinforcing steel...


I The data consists of 48 observations on measured bond
strength (used for bonding glass-fiber-reinforced plastic rebars
to concrete)
I The total area of all rectangles in a histogram drawn to
density scale is always 1

Levine STAT 511


The data

Figure :

Levine STAT 511


Histograms for continuous data with unequal class widths:
an example

Figure :

Levine STAT 511


Histogram shapes

I Symmetric unimodal
I Bimodal
I Right-skewed
I Left-skewed

Levine STAT 511


Figure :

Levine STAT 511


Sample mean

I The sample mean of the n numbers x1 , . . . , xn is


x1 + x2 + . . . + xn
x̄ =
n
I Alternative notation:
n
P
xi
i=1
x̄ =
n

Levine STAT 511


Example

I Recent years have seen growing commercial interest in the use


of what is known as internally cured concrete.
I This concrete contains porous inclusions most commonly in
the form of lightweight aggregate (LWA).
I The article Characterizing Lightweight Aggregate Desorption
at High Relative Humidities Using a Pressure Plate Apparatus
(J. of Materials in Civil Engr, 2012: 961969) reported on a
study in which researchers examined various physical
properties of 14 LWA specimens

Levine STAT 511


Example

I Here are the 24-hour water-absorption percentages for the


specimens

Figure :

I
P
I With the sum total xi = 229.0, the sample mean is

229.0
x̄ = = 16.36
14

Levine STAT 511


The sample mean is not robust!

I The mean value can be greatly affected by the presence of


even a single outlier (unusually large or small observation).
I If a sample of employees contains nine who earn 50, 000 per
year and one whose yearly salary is 150, 000, the sample mean
salary is 60, 000; this value certainly does not seem
representative of the data.

Levine STAT 511


Sample median

I The sample median x̃ is the middle value in the set of data


that has been arranged in ascending order. For an even
number of data points the median is the average of the
middle two.
I More precisely, suppose the number of observations n is odd.
Then, the median x̃ is the observation number n+1
2 .
I In the same way, if n is even,
 the median is defined as the
n n
average of 2 th and 2 + 1 th observations
I Median is a robust measure of the data center, unlike mean.
I The mean and the median are generally not the same

Levine STAT 511


Median calculation example

I The following data give the concentration for a specific


receptor for a sample of women with evidence of
iron-deficiency anemia When ordered, they are 7.6 8.3 9.3 9.4
9.4 9.7 10.4 11.5 11.9 15.2 16.2 20.4 .
9.7+10.4
I As n = 12 is even, the median is 2 = 10.05.
I What would happen if the largest observation 20.4 was not
there?

Levine STAT 511


Population mean and median

I The population mean is defined as the sum of the N


population values divided by N
I The sample mean is commonly used as a point estimate of
the population mean
I The population median µ̃ is defined as the ”middle value”
(in the same way as before) for the entire population. Again,
sample median is commonly used as a point estimate of the
population median.

Levine STAT 511


Trimmed mean

I Often the median and the mean are just two extremes...How
do we claim the middle ground? Consider the trimmed
mean.
I The mean does not discard any observations while the median
discards almost everything. We can discard a predetermined
number or percentage of observations as an alternative.
I The following dataset gives the copper percentages in a
sample of 26 Bidri artifacts
2.0 2.4 2.5 2.6 2.6 2.7 2.7 2.8 3.0 3.1 3.2 3.3
3.4 3.4 3.6 3.6 3.6 3.6 3.7 4.4 4.6 4.7 4.8 5.3
I The regular mean is x̄ = 3.65 and the median is x̃ = 3.35.
The difference is due to a large observation 10.1%

Levine STAT 511


Trimmed mean

I 7.7% trimmed mean is the result of removing 2 smallest and 2


largest observations - x̄tr(7.7) = 3.42.
I The 10% trimmed mean is an appropriate weighted average of
7.7% trimmed mean (trimming two values at each end) and
11.5% trimmed mean (trimming three values at each end)

Levine STAT 511


Measures of spread

I Variance measures the spread of the data


I The sample variance is
Pn
2 − x̄)2
i=1 (xi
s = .
n−1

I The sample standard deviation is s = s 2 and n − 1 is
referred to as the number of degrees of freedom

Levine STAT 511


Example

I Consider the prefabricated plate example; there are n = 11


plate elements that have been subjected toP a severe stress
test. If x is the length of resulting cracks, ni=1 xi = 18.349
and x̄ = 18.349
11 = 1.6681. Thus,
Pn
2 i=1 (xi − x̄)2 11.9359
s = = = 1.19359.
11 10

Levine STAT 511


Computing the sample variance

I An alternative computing formula for the s 2 is based on the


fact that
( xi )2
X X P
2 2
Sxx = (xi − x̄) = xi −
n
I Thus, we can write
xi )2
P
xi2 − (
P
2 n
s =
n−1
I This formula needs to be used with the largest decimal
accuracy possible

Levine STAT 511


Some properties of standard deviation

I Let x1 , . . . , xn be a sample of n observations and c any


non-zero constant. Denote sx2 the sample variance of x’s.
I 1. If y1 = x1 + c, . . . , yn = xn + c, then sy2 = sx2
2. If y1 = cx1 , . . . , yn = cxn , then sy2 = c 2 sx2 , sy = |c|sx .

Levine STAT 511


The fourth spread

I To define an alternative, again order n observations in a data


set from smallest to largest. Then, the lower (upper) fourth is
the median of the smallest (largest) half of the data; where
the median is included in both halves if n is odd.
I Then, the fourth spread is defined as

fs = upper fourth − lower fourth

I Any observation farther than 1.5fs from the closest fourth is


an outlier. An outlier is extreme if it is more than 3 fs from
the nearest fourth, and it is mild otherwise.

Levine STAT 511


Boxplots

I The simplest boxplot is a five-number summary : 1) minimum


2) lower fourth 3)median 4) upper fourth 5) maximum
I The ”whiskers” mark the location of the smallest and the
largest observations
I Corrosion data on the thickness of the floor plate of the crude
oil storage tank; each observation is the largest pit depth in
milli-in

Levine STAT 511


Boxplot example

Levine STAT 511


Boxplot example

Levine STAT 511

You might also like