CS395T - Computational Statistics With Application To Bioinformatics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

CS395T: Computational Statistics with Application to Bioinformatics

CS395T Computational Statistics with Application to Bioinformatics


Prof. William H. Press
Course Lecture Notes (Spring, 2008)
Unit 1: Probability Theory and Bayesian Inference

Concepts: probability theorems and examples; inference, Bayesian inference; marginalization;


nuisance parameter; posterior; Bernoulli trials; conjugate prior

MATLAB: syms, int, ezplot, diff, simplify, solve, pretty


Mathematica: Integrate, GenerateConditions, D, Plot, Simplify, Solve

Unit 2: Univariate Distributions and the Central Limit Theorem

Concepts: measures of central tendency, mean, median; normal (Gaussian), Student, Cauchy,
lognormal, exponential, gamma, chi-square; PDF, CDF, characteristic function; Central Limit
Theorem

NR3 (C++): Normaldist

Unit 3: Random Number Generators, Tests for Randomness, and Tail Tests Generally

Concepts: random number generator (RNG); multiplicative RNG, p-values, t-values; binomial
distribution; chi-square test; 1- vs. 2-point distribution; Xorshift RNG; combinations of
generators; p-value paradigm

MATLAB: uint32, mod, accumarray, betainc, normcdf, ceil, zeros,


chi2cdf
MATLAB API (C): mex functions, mxGetData, mxGetM, mxGetN
NR3 (C++): nr3.h, struct Toyran1, Chisqdist, Ran

Unit 4: Tail Test Perils and Pitfalls: Chi-Square Misuse, Multiple Hypotheses, Stopping
Criteria

Concepts: moments of chi-square variable, how chi-square becomes normal; chi-square


failure for Poisson events; linear constraints; multiple hypothesis correction, Bonferroni, FDR;
stopping rule paradoxes

MATLAB: symsum, betapdf, quad, linspace


Mathematica: Sum

http://nr.com/CS395T/lectures2008/toc.html[31/10/2014 10:30:12 πμ]


CS395T: Computational Statistics with Application to Bioinformatics

Unit 5: More on Random Deviate Generation

Concepts: Xorshift generators; matrix powers by successive squaring; GCD and Gorilla
randomness tests; transformation method; rejection method; ratio of uniforms method;
squeezes; Leva's algorithm

MATLAB: ndgrid, eye, spy, jacobian, abs, det


Mathematica: FactorInteger

Unit 6: Understanding Distributions Known Only Empirically

Concepts: empirical distributions, samples; Kolmogorov-Smirnov (KS) test; IQagent data


structure; genestats.dat data file, intron and exon lengths; plotting PDFs, uniformity of errors,
PDFs on log scales; resampling; statistical significance vs. data quantity

MATLAB: readgenestats (custom), fopen, fclose, repmat, cell, textscan,


dataset, error, cell2mat, plot, hold, log10, cdfplot, kstest2,
arrayfun, loglog, semilogy
MATLAB API (C): mxCreateDoubleMatrix
NR3 (C++): IQagent

Unit 7: Fitting Models to Data and Estimating Errors in Model-Derived Quantities

Concepts: binned data; nonlinear leaset squares (NLS) fits; covariance matrix; goodness of fit;
linear propagation of errors; Jacobian matrix; sampling the posterior distribution; bootstrap
resampling

MATLAB: hist, bar, nlinfit, nlinfitw (custom), diag, randn, numel,


chi2cdf, jacobian, subs, mvnrand, mean, std, randsample, arrayfun

Unit 8: Contingency Tables, Experimental Protocols, and All That

Concepts: contingency tables; null hypothesis; Pearson statistic; retrospective or case-control;


prospective or longitudinal; cross-sectional or snapshot; nuisance parameters,
marginalization; hypergeometric distribution; multinomial distribution; Fisher Exact Test;
Wald statistic; nominal, ordinal, cardinal tables; permutation test; bootstrap resampling;
Dirichlet distribution

MATLAB: crosstab, contingencytable (custom), sum, repmat, size,


squeeze, permute, ndgrid, repmat, accumarray, arrayfun, randperm,
hist, randsample, gamrnd, mnrnd, reshape

Unit 9: Working with Multivariate Normal Distributions

Concepts: multivariate normal distribution; covariance matrix; spliceosome; linear correlation


matrix; Cholesky decomposition; error ellipses

MATLAB: mean, cov, randsample, mvnrand, corrcoef, chol, errorellipse

http://nr.com/CS395T/lectures2008/toc.html[31/10/2014 10:30:12 πμ]


CS395T: Computational Statistics with Application to Bioinformatics

(custom)

Unit 10: Hierarchical Clustering by Phylogenetic Trees

Concepts: phylogenetic trees; cladograms, additive trees, ultrametric trees; distance matrix,
neighbor joining; agglomerative method; vertebrate species; gene chip; Hamming distance;
rooted vs. unrooted; gene co-expression; Pearson r; TreeView

NR3 (C++): Phylo_nj, newick

Unit 11: Gaussian Mixture Models and EM Methods

Concepts: Gaussian mixture model (GMM); E-step, M-step, EM method; log-sum-exp; k-


means clustering; Jensen's inequality; missing data problems

MATLAB: sum, repmat, arrayfun, ksdensity, mvnrnd


NR3 MATLAB interface: nr3_matlab.h, mxScalar, mxT, MatDoub, VecDoub

Unit 12: Maximum Likelihood Estimation (MLE) on a Statistical Model

Concepts: likelihood function; Fisher Information Matrix, Hessian; centered second


difference; outliers; Student-t; AIC, BIC;

MATLAB: hist, bar, fminsearch, hessian (custom), inv, jacobian, subs,


arrayfun

Unit 13: Markov Chain Monte Carlo (MCMC)

Concepts: unnormalized distribution, posterior; Markov chain; detailed balance, ergodicity;


Metropolis-Hastings algorithm, proposal distribution, acceptance probability; Poisson
process, fluctuations

MATLAB: rand, subfunction

Unit 14: SVD, PCA, and the Linear Perspective

Concepts: data matrix, design matrix; standardize; Singular Value Decomposition (SVD);
orthogonal basis; low-rank approximation; Principal Component Analysis (PCA); main
effects; Gaussian random matrix; order statistic; dimensional reduction; eigengenes,
eigenarrays; non-negative matrix factorization (NMF)

MATLAB: prctile, repmat, colormap, image, svd, axis, semilogy,


randn, cumsum

Unit 15: Dynamic Programming, Viterbi, and Needleman-Wunsch

Concepts: Bellman-Dijkstra-Viterbi algorithm, forward pass, backward pass; error-correcting

http://nr.com/CS395T/lectures2008/toc.html[31/10/2014 10:30:12 πμ]


CS395T: Computational Statistics with Application to Bioinformatics

code; trellis graph; soft decision decoding; sequence alignment; Needleman-Wunsch


algorithm; multiple alignment

NR3 (C++): stringalign

Unit 16: Hidden Markov Models

Concepts: Markov model; transition probability; irreducibility, aperiodicity, ergodicity;


successive squaring method; Hidden Markov Model (HMM); symbol probability; state
estimation; forward-backward algorithm, alpha pass, beta pass; Baum-Welch re-estimation;
likelihood; EM method; Generalized HMM, Hidden Semi-Markov Model

NR3 (C++): HMM


NR3 MATLAB interface: hmmmex

Unit 17: Classifier Performance: ROC, Precision-Recall, and All That


Concepts: confusion matrix, TP, FP, TN, FN; conservative, liberal; performance curve; TPR,
FPR, PPV, NPV, FDR; accuracy, sensitivity, specificity, precision, recall; ROC curve; convex
hull; precision-recall curve

Mathematica: Solve, FullSimplify, substitution operator (./)

Unit 18: Support Vector Machines (SVMs)


Concepts: linear separation; fat plane; maximum margin SVM; quadratic programming;
primal vs. dual problem; soft-margin SVM; embedding; the kernel trick; linear, power,
polynomial, sigmoid, Gaussian radial basis kernels; mitochondrial genes

Software: SVMlight

Unit 19: Wiener Filtering (and some Wavelets)


Concepts: signal, noise, filter; Wiener filter; best estimate in L2 norm; Fourier basis; Nyquist
frequency; low-pass filter; signal and noise models; spatial (pixel) basis; smoothed image;
wavelet basis; quadrature mirror filter; orthogonality conditions; moment conditions;
pyramidal algorithm; DAUB; left- and right-derivative

MATLAB: fopen, fread, fclose, flipud, image, axis, fft2, ndgrid,


randsample, ifft2, wiener2, wavelet2 (custom)

Unit 20: Multidimensional Interpolation on Scattered Data


Concepts: dimensional explosion; Shepherd interpolation; Radial Basis Function
interpolation; multiquadric, inverse multiquadric, thin plate spline, Gaussian; over- and
under-smoothing; Laplace interpolation; boundary conditions; biconjugate gradient method;
Gaussian process regression; linear prediction; Kriging; variogram

MATLAB: interp1, meshgrid, arrayfun, contour, cell, cellfun,

http://nr.com/CS395T/lectures2008/toc.html[31/10/2014 10:30:12 πμ]


CS395T: Computational Statistics with Application to Bioinformatics

shepinterp (custom), \-operator, std, laplaceinterp (custom), krig (custom)

Unit 21: Information Theory Characterization of Distributions


Concepts: character, alphabet, message; entropy; compression; log cut-down; fair game;
payoff odds; protein, amino acid; monographic, digraphic entropy; flattened; conditional
entropy; mutual information; Lagrange multiplier; Kelly's formula, proportional betting; CG
richness, 3rd codon; Kullbach-Leibler distance; log odds

http://nr.com/CS395T/lectures2008/toc.html[31/10/2014 10:30:12 πμ]

You might also like