A Short Course in Multivariate Statistical Methods With R
A Short Course in Multivariate Statistical Methods With R
Outline
• R environment, setup, basics
• Multivariate Analysis - what is it?
• Exploration and Visualization
• Principal Components
• Multidimensional Scaling
• Exploratory Factor Analysis
• Confirmatory Factor Analysis
– Structural Equation Modeling
• Cluster Analysis
• Repeated Measures
• Additional topics, wrapup
Goals
• Exposure to R
• Familiarity with major concepts used in multivariate analysis
• Implement these tools in R
• Learn “how to learn” - investigate and solve your own data problems
• Mastery is not possible in a short course. Don’t worry!
the R environment
• R
– free - easy to use and expand
– open - fast and innovative
– first - cutting edge of Data Science
• R Markdown
– supports integration of code and text
– multiple outputs (doc, html, pdf)
• RStudio
– consistent, coding friendly developing environment
– tools for publishing (Rpres, Rpubs)
– literate programming
– cross-platform and server version
For more details on authoring R presentations please visit
https://support.rstudio.com/hc/en-us/articles/200486468.
Setup
• download R
– R-project.org
• download RStudio
– Rstudio.com
• code and presentation available from
– github.com/ryandata
• Rstudio.cloud is an experimental cloud server for R, free for now
• other texts distributed locally
Probability Plots
• We may need to test whether data fits the multivariate normal distribution
• If (MV) normal, distance metric of a single variable will have a chi-squared distribution
• Plot (computation illustrated by R code) should show data points roughly on a straight
line
• Can identify outliers
Data Visualization
• Advantages of visualization:
– Easily detect patterns in data
– Generate greater interest, understanding, and recall [for non-specialists and
specialists alike]
– Compress the meaning of large amounts of data into a smaller set of images
– Discover hidden structure of data
Three-dimensional data
• Many tools can be used to visualize data in three dimensions
• Just a few examples in the code, more are illustrated at my Data Visualization
workshop
PCA, continued
• The principal components in principal components analysis are vectors
• Each vector is a linear combination of variables
𝑧 = 𝑎𝑥1 + 𝑏𝑥2 + 𝑐𝑥3 . ..
• We want to find the smallest number of vectors that account for most of the variation
in the data
• We do not know beforehand which variables are most useful for this task
• PCA solves this problem
• A low dimension summary of the data for graphing or other representations
• In m dimensions, we find the m-dimensional projection that best fits the data
Scale
• This method is not scale-invariant, i.e., it produces different results for different units
of measurement
• So, studying the covariance matrix for solutions also faces the scale-invariance
problem
• In practice, we use the correlation matrix instead to generate solutions (which is
scaled to unit interval)
• This also means we are essentially assuming that all variables will be equally
weighted, with equal potential of being part of the solution (not always appropriate)
Multidimensional Scaling
• An extension of PCA’s methodology
• Extract a low-dimensional representation of the data, preserving relative distances
• Works on the distance matrix
• Some measurement of how similar or dissimilar items are
• Here, two spatial methods:
– Classical Multidimensional Scaling
– Non-metric Multidimensional Scaling
Solving MDS
• Start with the (Euclidean) distance matrix (sometimes all we have)
• Compute an estimate of original data
• Because this method also uses the eigenvalues that account for most of the variation, it
is equivalent to principal components, and often called principal coordinates
• Find where $latex $ are “high”
∑𝑚
𝑖=1 𝜆𝑖
𝑙𝑎𝑡𝑒𝑥𝑃𝑚 = 𝑞
∑𝑖=1 𝜆𝑖
• Minimum spanning tree (“mst” command from “ape” package) can identify close
groupings of observations
Non-metric MDS
• Typically with ordered or ranked data, we can use a non-metric technique
• isoMDS command from “MASS” package
• use Shepard diagram to diagnose fit
Correspondance Analysis
• essentially a method for plotting associations between categorical variables
• Row variables that appear close in a plot are similar
• Column variables that appear close are similar
• Row/column pairs indicate association
Assessing Fit
• $latex ^2 $ statistic is typically used on the fitted vs. unconstrained covariance matrix
• Other measures, Goodness of Fit, Adjusted Goodness of Fit, and more
• normed residuals < 2 is another check
Cluster Analysis
• Classification is a fundamental tool for understanding data, with application across
physical, life, and social sciences
• Cluster analysis provides numerical methods for sorting data into meaningful groups.
• Many methods are possible, 3 are described here:
– agglomerative hierarchical techniques
– k-means clustering
– model-based clustering
Agglomerative hierarchical techniques
• Hierarchy is generated by steps
• Start with each individual observation
• Merge the closest two observations into a cluster
• Repeat…
• Relies on distance metric (often Euclidean)
• Methods vary (using max distance or min distance between clusters, or a central
measure)
k-means clustering
• In general, minimize some metric of aggregate distance
• In practice, minimize the within group sum of squares by choosing optimal k
$$latex \Huge {WGSS=\sum_{j=1}^{q}\sum_{l=1}^{k}\sum_{i \in G}(x_{ij}-
\bar{x}_j^{(l)})^2} $$
• Iterative process that finds a local, but not necessarily global, minimum by moving one
element at a time among clusters to see if it reduces group sum of squares
k-means continued
• k-means imposes spherical clusters due to its method, even if a better-fitting, odd
shaped cluster is available
• k-means is not scale invariant
• One way of finding optimal number of k is to plot the WGSS, and look for an “elbow”, or
prominent angle in the plot. This indicates that WGSS no longer reduces signficantly
from adding additional clusters.
Model-based clustering
• A model for the population from which data is sampled provides more tools for
selecting clusters via statistical criteria
• With subpopulations in different amounts and distributions, a finite mixture density for
the population as a whole is generated
• Probabilities associated with subpopulations are estimated via Maximum Likelihood
Estimation (usually via iterative method)
• mclust package in R implements this
• We can plot clusters in various ways, such as the “stripes” plot in package flexclust
• Stripes plot reveals overlap and separation between clusters across multiple
dimensions
Repeated measures
• Repeated measures describe some of the most common situations in data analysis
• Collecting multiple samples/observations on a single subject
• Collecting data over time
• Data can be recorded in “wide” or “long” format (convert with reshape command)
Mixed-effects models
• We know that the repeated observations are related to each other
• So treating each observation as independent (e.g., standard linear regression) is not
appropriate
• Model must separate the variation into two components’
• within group (of repeated measures)
• across groups
Wrapping Up
• Basic model does not solve for the values of the random effects
• Empirical Bayes estimates can be used to predict the values of the random effects (see
text)
• Mixed effects models are applicable in a wide variety of data analysis situations
• Unlike Principal Components, Factor Analysis, and other methods we have discussed,
there are no “cautions” to using them whenever appropriate