BMTRY 701 Biostatistical Methods II
BMTRY 701 Biostatistical Methods II
Biostatistical Methods II
Knowledge of Methods I!
You should be very familiar with
• confidence intervals
• hypothesis testing
t-tests
Z-tests
• graphical displays of data
• exploratory data analysis
estimating means, medians, quantiles of data
estimating variances, standard deviations
About the instructor
B.A. from Bowdoin College, 1994
• Double Major in Mathematics and Economics
• Minor in Classics
Ph.D. in Biostatistics from Johns Hopkins, 2000
• Dissertation research in latent class models, Adviser Scott Zeger
Assistant Professor in Oncology and Biostatistics at JHU, 2000-
2007
Taught course in Statistics for Psychosocial Research for 8 years
Applied Research Areas:
• oncology
Biostats Research Areas:
• latent variable modeling
• class discovery in microarray data
• methodology for early phase oncology clinical trials
Came to MUSC in Feb 2007
Computing
Nails
Graphical Displays
Arsenic in ppm
5
0
5
0
8
4
0
15
5
0
20
0
1.0
0.5
0.0
Male Female
And…the scatterplot
2.0
Level of Arsenic in Nails (ppm)
1.5
1.0
0.5
0.0
Y is on the vertical
X predicts Y
Terminology:
• “Regress Y on X”
• Y: dependent variable, response, outcome
• X: independent variable, covariate, regressor,
predictor, confounder
Linear regression a straight line
• important!
• this is key to linear regression
Simple vs. Multiple linear regresssion
Why ‘simple’?
• only one “x”
• we’ll talk about multiple linear regression later…
Multiple regression
• more than one “X”
• more to think about: selection of covariates
Not linear?
• need to think about transformations
• sometimes linear will do reasonably well
Association versus Causation
Be careful!
Association ≠ Causation
Statistical relationship does not mean X causes
Y
Could be
• X causes Y
• Y causes X
• something else causes both X and Y
• X and Y are spuriously associated in your sample of data
Example: vision and number of gray hairs
Basic Regression Model
Yi 0 1 X i i
Yi 0 X i1 i
Yi 0 1 2 X i i
NOT linear in the predictor:
log(Yi ) 0 1 X i i
Yi 2 0 1 X i i
Model Features
Yi is the sum of a constant piece and a random piece:
• β0 + β1Xi is constant piece (recall: x is treated as constant)
• εi is the random piece
Attributes of error term
• mean of residuals is 0: E(εi) = 0
• constant variance of residuals : σ2(εi ) = σ2 for all i
• residuals are uncorrelated: cov(εi, εj) = 0 for all i, j; i ≠ j
Consequences
• Expected value of response
E(Yi) = β0 + β1Xi
E(Y) = β0 + β1X
• Variance of Yi |Xi= σ2
• Yi and Yj are uncorrelated
Probability Distribution of Y
Scatterplot
15
length of stay
10
5
15
Results
10
5
0 200 400 600 800
number of beds
̂
Residual | 340.668443 ̂
111 3.06908508
0
-------------+------------------------------
R-squared
Adj R-squared
=
=
0.1675
0.1600
1
Total | 409.210379 112 3.6536641 Root MSE = 1.7519
------------------------------------------------------------------------------
los | Coef. Std. Err. t P>|t| ˆ
[95% Conf. Interval]
-------------+----------------------------------------------------------------
beds | .0040566 .0008584 4.73 0.000 .0023556 .0057576
_cons | 8.625364 .2720589 31.70 0.000 8.086262 9.164467
------------------------------------------------------------------------------
Another Example: Famous data
78
74
Son's Height, Inches
70
66
62
58
58 60 62 64 66 68 70 72 74 76 78
Call:
lm(formula = son ~ father)
Residuals:
Min 1Q ̂
Median 0 3Q Max
-7.72874 -1.39750 -0.04029 1.51871 7.66058
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.47177 3.96188 9.963 < 2e-16 ***
father 0.43099 0.05848 7.369 4.55e-12 ***
---
̂1 ˆ
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
x 67.7
Son's Height, Inches
y 68.6
70
66
62
x=y
regression line
58
58 60 62 64 66 68 70 72 74 76 78
i Yi 0 1 X i
Derivation
Two initial steps: reduce the following
N
i
( X
i 1
X ) 2
(X
i 1
i X )(Yi Y )
Least Squares
N
Q (Yi 0 1 X i ) 2
i 1
Least Squares