Apunts BLOC 1 Estadística
Apunts BLOC 1 Estadística
Apunts BLOC 1 Estadística
2. Objects:
Objects are assigned values using the <- operator (although = also works)
There are four modes: numeric, character, logical, complex.
To check the mode of an object: mode(name_object)
Objects which consist of multiple values of the same mode are called vectors vectors are
usually created using the c() function. If the vector consists of multiple values of different
modes, the vector has the mode most restrictive, the predominant.
6. User-defined functions
function.name <- function(arguments) {
computations on the arguments and some other code
}
Recursivity in R, use Recall() in the definition of a function to avoid problems.
Native R files (.RData files) are stored on the computer and can be directly loaded in R later
using the command load:
load("bikes data.RData")
10. Factors
Factors in R refer to categorical variables.
Columns containing non-numerical values are automatically considered factors by R.
Levels of a factor refer to the unique categories of the variable: levels(name_column).
The function factor converts a variable into a factor. If the data contain categorical variables
that have been coded using numbers then it is essential to convert these to factors before
running any statistical analyzes: gender.new <- factor(gender) or bikes.data$transport <-
factor(bikes.data$transport)
It is not essential to rename the levels to text rather than numbers, but it often helps to avoid
confusion: levels(gender.new) <- c("Male","Female")
Replace first value in gender.new (Male) by Female: gender.new[1] <- "Female", it can be
replaced by a valid category/level.
Automatic conversion to factor can be forced by setting the argument as.is = FALSE in the
function read.csv().
Factor to numeric values: as.numeric(as.character(x)). If the factor contains any values which
cannot be converted to a number these will be set to be NA, and a warning will be printed.
Creating your own categorical variable from the data: daytime <- (hour > 6) & (hour < 20);
fdaytime <- factor(daytime)
Basic Statistic:
The syntax for statistical modelling is generally very consistent, regardless of the type of
model/analysis being used: modellingfunction(myform, data = datasetname)
And the formula is generally of the form: myform <- responsevariable ~ explanatoryvariables
- T-test: A statistical test used to determine the existence of a difference between
the means of two groups. This test provides us with the T-value and the P-value.
The T-value measures the difference between the means of two groups in terms of
data variability. A high T-value suggests a significant difference in means. The P-
value is the probability of the observed difference being true. There is the null
hypothesis that asserts no real difference between the groups or that the observed
difference is random, and there is the alternative hypothesis that asserts a real
difference. If the P-value is low (less than 0.05), it indicates a low probability of the
difference being due to randomness. In this case, we reject the null hypothesis and
accept the alternative (when smaller is the p-value, more significantly is). However,
if the P-value is high (more than 0.05), we do not have sufficient evidence to reject
the null hypothesis, and it is accepted. – t.test(myform, data = name_data), pt():
calculate the cumulative probability of a Student's t distribution. The function
receives as arguments the value x (point of interest), the degrees of freedom df
and optionally lower.tail which indicates whether the probability is calculated in
the lower tail (by default it is TRUE). The cumulative probability of a t distribution
represents the probability that a random variable distributed according to a
Student's t takes a value less than or equal to x.
- One-way ANOVA: used to compare the mean of three or more different groups to
determine that almost one of them is significantly different from others, to know
the difference between the means of several groups. – aov(result ~ group, data =
name_data), you can use the summary of the aov to summarize the result of aov.
The summary provides information about statistic F and P-value. The statistic F is
the relation between the variability between and inside the groups, a high F-value
is more probably that almost one of the groups has a mean different. The P-value is
the probability to find a big difference between the groups. If the P-value is low, it
indicates that almost one of the groups has a different mean from others.
However, if the P-value is high, there is not sufficient evidence to conclude that
there are significative differences between the groups. - summary(aov(wtgain ~
diet, data = turkey)) or anova(lm(wtgain ~ diet, data = turkey)).
- Two-way ANOVA: it is a extension of one-way(factor) ANOVA, it allows to evaluate
the effect of two categorical variables (factors) in front the continue interest value.
- aov(wtgain ~ diet + housing + diet:housing, data = turkey) or aov(wtgain ~
diet*housing, data = turkey), summary(aov(wtgain ~ diet*housing, data = turkey))
- Linear regression: realize a lineal regression analisy and evaluate the relationship
between two or more variables: a response/dependent variable (y) and one
(simple linear regression) or more “explanatory variables”/independents (x), that is
possible to be a mix of categorical and numeric variables (multiple linear
regression)- ajust the lineal regression model: lm (dependent ~ independent1 + ….,
data = name_data), then interpretate the results: summary(model_created). A low
P-value indicates the dependence between the variables x-y. R-square indicates
how many the variability of the dependent variable is explicated by the model,
better when it’s closer to 1. F-statistic evaluates if some of the independent
variables has a significative effect in the dependent variable. We can extract the
ANOVA table for a linear regression using anova: anova(name_model). We can
produce graphical diagnostics, to assess how well the model fits, using plot:
plot(name_model): residuals vs fitted, normal q-q, scale-location and residuals vs
leverage.
Square brackets [,] can also be used to extract rows and columns from a data frame, or rank of
them using : .
x <- data.frame(my = c(1,2), yours = c("a", "b"))
Extract columns using their names:
names(name_data) # extract name of columns of the data
name_data$name_column
Also:
new <- turkey[turkey$initwt > 25,] # select rows that initwt >25
new <- turkey new$wtgain[new$initwt > 25] <- NA # set NA to the rows whete initwt > 25
1. Introduction
Data
Data Structures
Data Types
Data pre-processing:
Parametric (model-based)
Non-parametric
Descriptive statistics help you understand and summarize your data, while inferential statistics
allow you to draw conclusions beyond your sample.
Frequentist Paradigm:
Bayesian paradigm:
In short, the frequentist views parameters as fixed and uses probabilities to analyze repeated
collections of data. The Bayesian views parameters as random variables and uses prior
knowledge along with the data to update beliefs and obtain posterior distributions.
It consists of the process of visualization, summary and initial understanding of a data set.
Helps you gain a deep understanding of the data before performing more advanced analysis or
modeling.
Summary plots
- Histogram (hist(name_variable)): shows the distribution of the data and look for:
skewness, outliers, multimodality. INTERPRETATION: Shape of the Distribution:
Observe the general shape of the histogram. It can be symmetrical, skewed to the
right (positively skewed), skewed to the left (negatively skewed), or have other
specific shapes. Modes (Peaks): The peaks in the histogram represent the "modes"
or values that occur most frequently. Dispersion and Kurtosis: Dispersion is related
to the extent or width of the distribution. A wider histogram indicates greater
dispersion. Kurtosis describes the shape of the tails of the distribution (whether
they are pointed or flat). Outliers: Observe if there are outliers that are very far
from the rest of the data. -> A logarithmic transformation when there is an
exponential growth to linearize the relationship or to stabilize/reduce the variance
in right-skewed data and improve symmetry. Or square rooting when the data is
not negative and to stabilize/reduce variance in rightskewed data.
- Boxplot(name_variable): The box contains 50% of the data. The whiskers stretch
out to where most of the data would be expected. Points outside the whiskers are
potential outliers.
pairs(): visualize relationships between multiple variables in a data set. Each cell in the matrix
generated by pairs() represents a scatter plot showing the relationship between two variables.
The main diagonal contains histograms of each individual variable.
INTERPRETATION:
Positive or Negative Relationship: If the points tend to form an ascending line or curve, it
indicates a positive relationship. If they tend to form a descending line or curve, it indicates a
negative relationship.
Dispersion: The dispersion of the points around the trend line indicates the strength of the
relationship between the two variables. A narrower dispersion indicates a stronger
relationship.
Outliers: You can identify outliers or outliers that deviate from the general pattern.
Shape of the Distribution: The histograms on the main diagonal show the univariate
distribution of each variable. You can see if they follow a normal distribution or if they show
some type of skew.
pairs(whr[c(4,5,6,8)], col=rain) # scatterplot of the selectioned variables and check the factor
levels rain
Subsetting data: extract a subset of rows or columns from a data set based on certain
conditions - subset(name_data, condition, select(c(columns)), ...)
Probability (Prob): a number between 0 (impossible) and 1 (certain) measuring the uncertainty
of an event.
Full table of probabilities: addmargins(prop.table(table(name_column)))
Bell Shape: a beak in the center and tails extending to both sides.
Mean and Standard Deviation: The normal distribution is determined by its mean (μ) and its
standard deviation (σ). The mean determines the location of the peak and the standard
deviation controls how wide or narrow the bell is.
Practically: is based on observed data and calculates the proportion of values that meet a
specific condition.
prop.table(table(x< 1.5))
FALSE TRUE
0.057 0.943
pnorm(1.5,mean=0,sd=1)
0.9332
Quantiles: For the standard normal distribution we might denote these by z0.25, z0.50 and
z0.75 - z25 <- qnorm(.25,mean=0,sd=1)
Discrete distributions: applies to random variables that can take a finite or countably infinite
set of specific and distinct (discrete) values.
Population: the total group of interest, the target of the data analysis we want to draw
conclusions about, not possible to measure everyone.
Sample: affordable subset of the population, selected by some sampling method (typically
some random procedure to avoid biases).
Sample unit: every single element (individual, object, …) being studied by theis variables
(data/values). The size of the sample is the number of sample units chosen
We see the variability of the sample means is much smaller than the population one.
(sample mean ± 1.96σ/ n ) is called a 95% confidence interval for the population mean μ. In
general, for any parameter θ with a standard error SE(θ), an approximate confidence interval
will be:
Estimate of θ ± 2*SE(θ)
Significance testing:
Hypothesis test: a way to statistically measure the support the data provide to a research idea
(hypothesis). It is used to make decisions about a statement or hypothesis about a population
parameter, based on the evidence provided by a data sample.
- Null Hypothesis (H0 ): it’s easier to disprove rather than prove an idea, so we state
it in the negative and see if we can reject that statement
- Alternative Hypothesis (HA ): the reverse of the null hypothesis.
- Test statistic: a quantity derived from the sample used to perform the test.
- P-value, cutoff value - a probability, usually called alpha and set at 𝛼 = 0.05.
Measures the probability of obtaining the observed value of the test statistic if the
null hypothesis were true.
- Statistical significance: says that what we have found in the sample is likely to
happen in the population of interest
- With large sample sizes (e.g. 10,000 patients), we can be confident of statistically
detecting every tiny relationship or difference in our sample, but …
- “Real-world” significance: says that what we have found statistically is actually
meaningful, relevant in the context of application
Types of errors:
- Type I: rejecting the null hypothesis when it’s true (false positive)
- Type II: accepting the null hypothesis when it’s false (false negative)
- Multiple testing problem: need to control the overall error rate when performing
multiple comparisons or statistical tests to avoid incorrect conclusions due to
significant results obtained by chance → Bonferroni, Benjamini and Hochberg’s
false discovery rate control, …
- Power: the probability of correctly rejecting the null hypothesis. Related to sample
size, effect size, significance level, … → power calculations to design experiments
The chi-square test: determine if there is an association between two categorical variables.
How to use:
1. Hypothesis formulation:
- Null Hypothesis (H0): There is no association between the variables.
- Alternative Hypothesis (H1): There is an association between the variables.
2. Data collection: The data should be organized in a contingency table that shows the
observed frequencies for each combination of categories of the two variables.
3. Calculation of the Chi-Square Statistic: The chi-square statistic is calculated from the
observed and expected frequencies.
4. Degrees of freedom: The number of degrees of freedom depends on the size of the
table and is calculated as (r-1)(c-1), where "r" is the number of rows and "c" is the
number of columns in the table.
5. Comparison with the Critical Value or p-value: Using the chi-square distribution and
degrees of freedom, determine whether the observed chi-square value is statistically
significant. This is done by comparing the calculated value with a critical value from the
chi-square table or by calculating a p-value.
How to interpret:
If the calculated chi-square value is greater than the critical value in the table or if the p-value
is less than a previously established significance level (for example, 0.05), then the null
hypothesis is rejected and it is concluded that there is a significant association between the
variables.
If the calculated chi-square value is not large enough or the p-value is greater than the
significance level, there is not enough evidence to reject the null hypothesis and it is concluded
that there is no significant association between the variables.
NOTE: Contingency table is a tool in statistics that is used to organize and summarize
categorical data in the form of a matrix of two or more dimensions.