1st Unit Notes
1st Unit Notes
1st Unit Notes
Data Exploration:
Exploration, one of the first steps in data preparation, is a way to get to know data before
working with it.
Why Is Data Exploration Important?
Exploration allows for deeper understanding of a dataset, making it easier to navigate and use
the data later.
The better an analyst knows the data they’re working with, the better their analysis will be.
Successful exploration begins with an open mind, reveals new paths for discovery, and helps
to identify and refine future analytics questions and problems.
How Data Exploration Works
Data without a question is simply information. Asking a question of data turns it into an answer.
Data with the right questions and exploration can provide a deeper understanding of how things
work and even enable predictive abilities.
Data exploration typically follows three steps:
Understand the Variables: The basis for any data analysis begins with an understanding of
variables. A quick read of column names is a good place to start. A closer look at data
catalogues, field descriptions, and metadata can offer insight into to what each field represents
and help discover missing or incomplete data.
Detect Any Outliers: Outliers or anomalies can derail an analysis and distort the reality of a
dataset, so it’s important to identify them early on.
Examine Patterns and Relationships: Plotting a dataset in a variety of ways makes it easier to
identify and examine the patterns and relationships among variables. For example, a business
exploring data from multiple stores may have information on location, population, temperature,
and per capita income. To estimate sales for a new location, they need to decide which variables
to include in their predictive model.
1.Introduction to single variable
It is the simplest form of data analysis, where the data being analyzed consist of only
one variable. Since it is a single variable it doesn’t deals with causes and relationships. The
main purpose of this is to describe the data and find patterns that exist within it.
1.1 Distributions and Variables
Two organizing concepts have become the basis of the language of data analysis: cases
and variables.
The cases are the basic units of analysis, the things about which information is
collected.
The word variable expresses the fact that this features varies across the different cases.
Variables could be either categorical or numerical. Numerical variables can be
transformed into categorical by the process called binning. Transformation from categorical to
numerical is called encoding
A categorical or discrete variable is one that has two or more categories(values).
There are two types of categorical variable, nominal and ordinal.
A nominal variable has no intrinsic ordering to its categories
For example, gender is the categorical variable having two categories( male or female)
with no intrinsic ordering to the categories.
An ordinal variable has a clear ordering. For example, temperature as a variable with three
orderly categories (low,medium,high)
A numerical or contiuous variable is one that may take on any value within a finite or
infinite interval(eg: height, weight,temperature…).
There are two type of interval variable, intervals and ratio.
An interval variable has values whose differences are interpretable, but it does not have
a true zero. A good example is temperature in centigrade degrees.
Data on an interval scale can be added and subtracted but cannot be meaningfully
multiplied or divided.
A ratio variable has values with a true zero and can be added,subtracted,multiplied or
divided (eg:weight).
1.2 Reducing the number of digit
The human brain is easily confused by an excess of detail. Numbers with many digits
are hard to read, and important features, such as their order of magnitude, may be obscured.
Some of the digits in a dataset vary, while others do not. In the following case: 134 121 167
there are two varying digits (the first is always 1).
In the following, there are also two varying digits: 0.034 0.045 0.062 whereas in the
following case: 0.67 1.31 0.92 there are three varying digits. If we wish to perform calculations
on the numbers, it is usually best to keep three varying digits until the end, and then display
only two.
There are two techniques for reducing the number of digits. The first is known as
rounding Values from zero to four are rounded down, and six to ten are rounded up. The digit
five causes a problem; it could be rounded up after an odd digit and down after an even digit.
A second method of losing digits is simply cutting off or 'truncating' the ones that we do not
want. Thus, when cutting, all the numbers from 899.0 to 899.9 become 899. This procedure is
much quicker and does not run the extra risk of large mistakes.
1.3 Bar charts and pie chart
Blocks or lines of data are very hard to make sense of. We need an easier way of
visualizing how any variable is distributed across our cases. One simple device is the bar chart,
a visual display in which bars are drawn to represent each category of a variable such that the
length of the bar is proportional to the number of cases in the category.
A pie chart can also be used to display the same information. It is largely a matter of
taste whether data from a categorical variable are displayed in a bar chart or a pie chart. In
general, pie charts are to be preferred when there are only a few categories and when the sizes
of the categories are very different.
1.4 Histograms:
Charts that are somewhat similar to bar charts can be used to display interval level
variables grouped into categories and these are called histograms. They are constructed in
exactly the same way as bar charts except, of course, that the ordering of the categories is fixed,
and care has to be taken to show exactly how the data were grouped.
Features visible in histograms
Histograms allow inspection of four important aspects of any distribution
Distributions with one peak are called unimodal, and those with two peaks are called bimodal.
From interval level to ordinal level variables – recoding
Recoding variables in this way can be particularly useful if the aim is to present results from
surveys in simple tables.
1.5 Using SPSS to produce bar charts, pie charts and histograms
• SPSS (the Statistical Package for the Social Sciences)
• It is also frequently used by researchers in market research companies, local
authorities, health authorities and government departments.
Getting started with SPSS
SPSS is a very useful computer package which includes hundreds of different procedures for
displaying and analysing data.
Rather than trying to discover and understand all the facilities that SPSS provides, it is
better to start by focusing on mastering just a few procedures.
SPSS has three main windows:
• The Data Editor window
• The Output window
• The Syntax window
When you first open SPSS, the Data Editor window will be displayed. This will be empty
until you either open an existing data file or type in your own data - in the same way that you
would enter data into a spreadsheet like Excel.
When you use SPSS to produce a graph, a table or some statistical analysis, your results
will appear in an Output Viewer window. All the contents of the Output Viewer into an Output
file so that you can come back to them later.
SPSS syntax consists of keywords and commands that need to be entered very precisely
and in the correct order.
2. Numerical Summaries of Level and Spread
In this we will focus on the topic of working hours to demonstrate how simple
descriptive statistics can be used to provide numerical summaries of level and spread.
2.1 Working hours of men and women
The histograms of the working hours distributions of men and women are shown in
figures below. We can compare these two distributions in terms of the four features introduced
in the previous chapter, namely level, spread, shape and outliers. We can then see that:
• The male batch is at a higher level than the female batch
• The two distributions are somewhat similarly spread out
Fig : Weekly hours worked by men Fig: Weekly hours worked by women
• The female batch is bimodal suggesting there are two rather different underlying populations
• The male batch is uni-modal
Although there are no extremely high values, those working for over 90 hours per week
stand out as being rather different from the bulk of the population.
These verbal descriptions of the differences between the male and female working
hours' distributions are rather vague.
Perhaps the features could be summarized numerically, so that we could give a typical
numerical value for male hours and female hours, a single number summary for how spread
out the two distributions were.
A summary always involves some loss of information. As the quote, summaries cannot
be expected to contain the richness of information that existed in the original picture.
However, they do have important advantages.
▪ They focus the attention of the data analyst on one thing at a time, and prevent
the eye from wandering aimlessly over a display of the data.
▪ They also help focus the process of comparison from one dataset to another, and
make it more rigorous.
2.2 Summaries of level
The level expresses where on the scale of numbers found in the dataset the distribution
is concentrated.
In the previous example(figure), it expresses where on a scale running from 1 hour per
week to 100 hours per week is the distribution's centre point.
To summarize these values, one number must be found to express the typical hours
worked by men, for example.
The problem is: how do we define 'typical'?
There are many possible answers.
• The value half-way between the extremes might be chosen, or the single most
common number of hours worked, or a summary of the middle portion of the
distribution.
With a little imagination we could produce many candidates. Therefore it is important to agree
on what basis the choice should be made.
Residuals
Before introducing some possible alternative summaries of level, it is helpful to
introduce the idea of a 'residual'.
A residual can be defined as the difference between a data point and the observed
typical, or average, value. For example if we had chosen 40 hours a week as the typical level
of men's working hours, then a man who was recorded in the survey as working 45 hours a
week would have a residual of 5 hours. Another way of expressing this is to say that the residual
is the observed data value minus the predicted value and in this case 45-40 = 5.
In this example the process of calculating residuals is a way of recasting the hours
worked by each man in terms of distances from typical male working hours in the sample. Any
data value such as a measurement of hours worked or income earned can be thought of as being
composed of two components: a fitted part and a residual part. This can be expressed as an
equation:
Data = Fit + Residual
The median
The value of the case at the middle of an ordered distribution would seem to have an
intuitive claim to typicality. Finding such a number is easy when there are very few cases.
In the example of hours worked by a small random sample of 15 men (figure ), the
value of 48 hours per week fits the bill. There are six men who work fewer hours and seven
men who work more hours while two men work exactly 48 hours per week. Similarly, in the
female data, the value of the middle case is 3 7 hours. The data value that meets this criterion
is called the median: the value of the case that has equal numbers of data points above and
below it.
Fig: Mens working hour ranked to show median Fig: Womens working hour ranked to show median
The median is easy to find when, as here, there are an odd number of data points.
When the number of data points is even, it is an interval, not one case, which splits the
distribution into two.
The value of the median is conventionally taken to be half-way between the two middle cases.
Thus the median in a dataset with fifty data points would be half-way between the values of
the 25th and 26th data points.
Put formally, with N data points, the median M is the value at depth (N + 1)/2. It is not the
value at depth N/2. With twenty data points, for example, the tenth case has nine points which
lie below it and ten above.
Why choose the median as the typical value?
It is a point at which the sum of absolute residuals from that point is at a minimum. (An
absolute value denotes the magnitude of a number, regardless of its sign.)
In other words, if we fit the median, calculate the residuals for every data point and then
add them up ignoring the sign, the answer we get will be the same or smaller than it would
have been if we had picked the value of any other point.
In short, the median defines 'typical' in a particular way: making the size of the residuals
as small as possible.
The arithmetic mean
Another commonly used measure of the centre of a distribution is the arithmetic mean. Indeed,
it is so commonly used that it has even become known as the average. It is conventionally
written as Y(pronounced 'Y bar'). To calculate it, first all of the values are summed, and then
the total is divided by the number of data points. In more mathematical terms:
We have come across N before. The symbol Y is conventionally used to refer to an actual
variable. The subscript i is an index to tell us which case is being referred to. So, in this case,
Y; refers to all the values of the hours variable. The Greek letter ∑ pronounced 'sigma', is the
mathematician's way of saying 'the sum of'.
The deviations from the mean are squared, summed and divided by the sample size (well, N -
1 actually, for technical reasons), and then the square root is taken to return to the original units.
The order in which the calculations are performed is very important. As always, calculations
within brackets are performed first, then multiplication and division, then addition (including
summation) and subtraction. Without the square root, the measure is called the variance, s2
The layout for a worksheet to calculate the standard deviation of the hours worked by this small
sample of men is shown in figure
The original data values are written in the first column, and the sum and mean calculated at the
bottom. The residuals are calculated and displayed in column 2, and their squared values are
placed in column 3. The sum of these squared
values is shown at the foot of column 3, and from it the standard deviation is calculated.
Next select the variable for which you want to calculate the median and midspread, mean and
standard deviation. For example 'workhrs' has been selected below (in figure ). It is also worth
unchecking the box labelled 'Display frequency tables' for a variable such as this one, which
takes many different values, because with a large dataset a very large table is produced that is
unwieldy in the SPSS output.
Then click on the 'Statistics' button and select the descriptive statistics that you want SPSS to
calculate
You will notice that the mean and median are grouped together under measures of 'Central
Tendency'. This is just another term for measures of 'level'. In this example we have specified
that SPSS should provide the quartiles as well as the deciles (cut points for 10 equal groups).
Once you have chosen the statistics you want to calculate, click on the 'Continue' button and
then click on OK in the next dialogue box. Alternatively, the syntax for this command is shown
in the box.
This syntax is automatically written to a new syntax file for you if you click on the 'Paste'
button rather than the 'OK' button in the frequencies dialogue box. To 'Run' the syntax (i.e. to
get the computer to obey your commands) you simply highlight the syntax with the cursor and
click on the arrow button circled in the illustration below.
The output produced by these commands is shown below. This provides the summary statistics
for hours worked per week for men and women combined (note that there are now 12,519 valid
cases (6,392 men plus 6,127 women). We can see for example that the interquartile range is
12.5 and that in other words 50 per cent of the sample work for between 27.5 and 40 hours per
week.
In order to produce these summary statistics separately for men and women, as shown in figures
2.5 and 2.6 the same frequencies commands can be run in SPSS but first it is necessary
temporarily to select men and then temporarily to select women. Once again this can either be
done using the menus or using syntax. The correct syntax is displayed in the box.
To use the menus to select a subset of cases, first choose 'Select cases' from the 'Data' menu
Then select the 'If condition is satisfied' option as shown below and click on the 'If ... ' button.
Use the next dialogue box to indicate that you want to select only the men to analyse by
specifying Sex = l. Finally click on 'Continue' and then on the 'OK' button.
Fig: Specifying a condition for selected cases to satisfy.
You have now used the menus to specify that all subsequent analyses should be performed on
men only, i.e. those cases for whom the variable 'Sex' has been coded as '1'. If you subsequently
want to carry out analyses on women only it is necessary to follow the same process but to
substitute 'Sex = 2' for 'Sex = 1' in the final dialogue box. Alternatively if you want to return to
an analysis of all cases then use the 'Select cases' option on the 'Data' menu to select the 'All
cases' option as shown in figure
Exercises
2.1 Pick any three numbers and calculate their mean and median. Calculate the residuals and
squared residuals from each, and sum them. Confirm that the median produces smaller
absolute residuals and the mean produces smaller squared residuals. (You are not expected
to prove that this is bound to be the case.)
A variable which has been standardized in this way is forced to have a mean or median of 0
and a standard deviation or midspread of 1.
In other words this individual has a reading score which is eight-tenths (or fourfifths)
of a standard deviation above the mean. The same individual's mathematics score becomes (17
- 12.75)/7, or 0.61. This first respondent is therefore above average in both reading and maths.
To summarize, we can add these two together and arrive at a score of 1.41 for attainment in
general.
It is very straightforward to create standardized variables using SPSS. By using the
Descriptives command, the SPSS package will automatically save a standardized version of
any variable. First select the menus
Analyze > Descriptive Statistics > Descriptives
The next stage is to select the variables that you wish to standardize, in this case N2928
and N2930, and check the box next to 'Save standardized values as variables.' The SPSS
package will then automatically save new standardized variables with the suffix Z. In this
example, two new variables ZN2928 and ZN2930 are created.
Household are first organised according to their respective household incomes,from the lowest
to the highest. A Line is drawn through the points where the cumulative propotion of the
households is plotted against the cumulative propotion of income. This is the Lorenz curve.
A Lorenz curve is a graphical representation of the distribution of income or wealth within
a population.
If every household has the same income the Lorenz curve lies along the 45-degree line.This is
known as perfect equality line,where the household income of all the household is equally
distributed.
If One household has all the income and others zero income the lorenze curve is flat at
horizontal axis before it rises to 100%,where all the income is earned by one household.This is
Known as perfect inequality line.
The Gini Coefficient varies between the values of 0 and 1.The Gini coefficient is zero
when the Lorenz curve coincides with the perfect equality line,which reflect perfect equality.As
the Lorenz curve moves away from the perfect equality line,the gini coefficient
increases,reflecting greater uneven spread of household incomes and subsequently greater
income inequality.
The Gini coefficient is One when the Lorenz curve coincides with the perfect inequality
line,which reflect perfect inequality.
The Gini coefficient is used as the measure of income inequality.
Example:
5. Smoothing Time Series
Smoothing is usually done to help us better see patterns, trends for example, in time
series. Generally smooth out the irregular roughness to see a clearer signal. For seasonal data,
we might smooth out the seasonality so that we can identify the trend. Smoothing doesn’t
provide us with a model, but it can be a good first step in describing various components of the
series.
The term filter is sometimes used to describe a smoothing procedure. For instance, if
the smoothed value for a particular time is calculated as a linear combination of observations
for surrounding times, it might be said that we’ve applied a linear filter to the data (not the
same as saying the result is a straight line, by the way).
Some terminology used here
• Time-series: It’s a sequence of data points taken at successive equally spaced points in
time.
• Level: The average value in the series.
• Trend: The increasing or decreasing value in the series.
• Seasonality: The repeating short-term cycle in the series.
Some symbols used to represent this
The Apply shorter time window at start and end parameter is used to control the time window
at the start and end of the time series. If a shorter window is not applied, smoothed values will
be null for any record where the time window extends before the start or after the end of the
time series. If the time window is shortened, the time window will truncate at the start and end
and smooth using the values within the window. For example, if you have daily data and use a
backward moving average with a two-day time window, the smoothed values of the first two
days will be null if the time window is not shortened (note that the second day is only one day
after the start of the time series). On the third day (two days after the start of the time series),
the two day time window will not extend before the start, so the smoothed value of the third
day will be the average of the values of the first three days.