Descriptive Data Analysis Using PSPP and EpiData Analysis

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

DESCRIPTIVE DATA ANALYSIS USING PSPP AND EPIDATA

ANALYSIS
Load cancer.sav into PSPP as follows:
1. Go to File menu and pull down the menu to Open.
2. In the dialog box that follows ,go straight to the desktop where you must have saved
cancer.sav and click on it.
3. Click on Open.

The cytology dataset


The variables in the data set are:
labo_id: This is the identification number of the laboratory where the sample of cells from
womens cervix were screen to see if the women had cancerous lesions.
preparaa: This is the serial number given to the slide prepared from the cervical cells
bloodpus: Whether the sample of cervical cells had too much blood or inflamed cells
Codes:
0 = Missing value
1= too much blood or pus
2=not obscured by blood or pus
cytolse: Whether there was excessive cytosis or not
Codes:
0 = Missing value
1= Excessive cytolysis
2=No cytolysis
sil: Nature of the squamous epithelium according to the Bethesda Classification system
Codes:
0 = Missing value
1= Normal
2=ASCUS
3=L-SIL
4-H-SIL
5= Suggestive for invasive cancer
bact: Whether the sample of cells had bacteria overgrowth or not
Codes:
0=Missing value
1= Yes
2=N0
hpv: Whether the human papilloma various was present in the cervical cells
Codes:
0=Missing value
1= Yes
2=N0
s:Nature of the squamous lesions as inferred from the cervical cells
Codes:
0 = Missing value
1= Normal

2=ASCUS
3=L-SIL
4-H-SIL or cancer
univ: The university where the samples of cervical cells were sent for examination.
Codes:
1= University of Brussels(vub)
2=University of Leuven(kul)

BIVARIATE DATA ANALYSIS USING PSPP AND EPIDATA ANALYSIS


Categorical Data:
To produce a frequency distribution of diagnosis (s) from the slide produced from the cervical
cells, find the frequency distribution of the variable s.
1. Pull down the Analysis menu of PSPP to Descriptive Statistics and then to
Frequencies.
2. Click on the variable squamous lesions on the left.
3. Click on the arrow in the upper part of the screen between the variables box on the
left and the list of variables on the right. The variable s should appear in the Variables
box on the right.
4. Click on OK. The frequency distribution for s will appear in a separate Output
window.
QUESTIONS:
1.The variables s is measured on the ordinal scale with L-SIL being less serious than H-SIL.
The presence of LSIL and H-SIL shows that the patient has abnormal cervical lesions. Which
category had the highest percentage of abnormal lesions?
2.Repeat the process for variables bact, hpv, univ and bloodpus.
How many slides had HPV?
How many slides were prepared from the University of Brussels(KUB)?,
University of Leuven (KU)
How many slides were of poor quality (had too much blood/pus)?

DRAWING A PIE CHART


To draw a pie chart, proceed as follows:
1. Pull down the Analysis menu of PSPP to Descriptive Statistics and then to
Frequencies.
2. Click on the variable squamous lesions on the left.
3. Click on the arrow in the upper part of the screen between the variables box on the
left and the list of variables on the right. The variable s should appear in the Variables
box on the right.
4. In statistics Dialog box, deselect everything. Make sure no statistic is ticked.
5. Click on Charts.
6. Under Pie charts, tick Draw pie chart by clicking in the box to the left of it.
7. Click on Continue.
8. Click on Frequency Tables
9. Under the heading Draw Frequency Tables choose Never by clicking on the radio
button to the left of it.
10. Click Continue
11. Click OK.
The pie chart for s will appear in a separate Output window.

SUMMARISING CONTINUOUS DATA


Load Bangwe data. The data are in the file called bangwe.sav on your desktop.
To produce summary statistics for weight(wt) and age (approxag),
1. Pull down the Analysis menu of PSPP to Descriptive Statistics and then to
Descriptives.
2. Click on the variable weight(wt) and age (approxag), on the left.
3. Click on the arrow in the middle of the screen between the list of variables on the
right and the Variables and Statistics boxes on the left and. The variables weight(wt)
and age (approxag), will appear in the Variables box on the right.
4. In the Statistics box, select the statistics you want computed from your data.
5. Click on OK. The results will appear in the Output window.
DRAWING A HISTOGRAM
To produce histograms for weight(wt) and age (approxag),
1. Pull down the Analysis menu of PSPP to Descriptive Statistics and then to
Frequencies.
2. Click on the variable weight(wt) and age (approxag), on the left.
3. Click on the arrow in the middle of the screen between the list of variables on the
right and the Variables and Statistics boxes on the left and. The variables weight(wt)
and age (approxag), will appear in the Variables box on the right.
4. Click on Charts.
5. Under Histograms, select Draw histograms by clicking in the box to the left.
6. Click on Continue.
7. Click on Frequency Tables.
8. Under Display Frequency Tables , choose Never by clicking on the radio button to the
left of never.
9. Click on Continue.
10. Click on Ok.The histogram will appear in the Output window.

BIVARIATE ANALYSIS
TWO CATEGORICAL VARIABLES:
Reload the cancer data into PSPP.
Produce a contingency table of HPV cross-classified by s.
Proceed as follows:
1. Pull down the Analysis menu of PSPP to Descriptive Statistics and then to Crosstabs.
2. Click on the variable hpv on the left.
3. Click on the arrow to the left of the Columns box. This action should send the
variable hpv into the Columns box.
4. Click on the variable squamous lessions on the left.
5. Click on the arrow to the left of the Rows box. This action should send the variable
squamous lessions into the Rows box.
6. Click on OK. The contingency table of HPV cross-classified by s will appear in a
separate Output window.
QUESTIONS

1. Is the presence of hpv significanctly associated with the grade of squamous


lesion? Use the results from the Chi-squared test below the contingency table in the output
window.
2.Draw contingency tables of :
a. s and bloodpus
b. s and cytolyse
c. s and bact
d. s and hpv
e. s and univ
3.From the results below each table in a through e above, is the grade of lesion(s) associated
with presence of blood/pus
Excessive cytolysis
Presence of bacteria
Presence of hpv
University where examination conducted.
4.Produce a contingency tables of grade of lesion (s) and univ.
Which university had a higher percentage of ASCUS slides?

TWO CONTINUOUS VARIABLES


Reload Bangwe data. We want to investigate the relationship between and and weight.
Start EpiData by:
1. Clicking on the EpiData Analysis icon on the desktop or start menu directly or
2. By first starting EpiData entry and pulling down the Tools menu to EpiData Analysis.
Load Bangwe data. We will load the file named bangwe.rec.

DRAWING A SCATTERPLOT
In EpiData Entry:
1. Click on Graph, then Scatter.
2. The X variable is age(approxag). Pull down the arrow to the left of the Xvariable box
and chose approxag.
3. The Y variable is weight(wt). Pull down the arrow to the left of the Yvariable box and
chose wt.
4. Click on Run. The scatterplot will appear in the output window.

COMPUTING THE CORRELATION COEFFICIENT


1.
2.
3.
4.

Click on Analysis,then pull down the menu to choose Correlate.


Choose Age(approxag) by clicking the box on its left.
Choose Weight(wt) by clicking the box on its left.
Click on Run to complete the analysis.

A CONTINUOUS VARIABLE AND A CATEGORICAL VARIABLE


We want to find out the relationship between sex and age. We will load Bangwe data.
To determine the existence of a significant relationship between the two variable we will use
the t-test. The underlying assumption for the t-test is the normality assumption.
First, let us draw a histogram for age and see if it is normal.
We will proceed as follows:
1. Click on Analysis, then Histogram.
2. Choose weight(wt) as the x variable by pulling down the arrow opposite the letter X
variable until you choose wt.
3. Click on Run.
The correct way of finding out if the normality assumption holds is to draw the histogram of
weight for each sex separately. The distribution of weight is symmetric so we proceed to
conduct the t-test. We next compare the confidence intervals of weight for each sex. If they
overlap, weight is not associated with sex.

DRAWING A MEANS PLOT USING EXCEL.


The mean age of men is 50.9kg. The mean weight of women is 45.7kg.
In Excel, construct the following table
Sex
Men
Wome
n

Weigh
t
50.9
45.7

To draw the means plot:


1.
2.
3.
4.

Highlight the data in the table.


Pull down the Insert menu and click on Line.
Under Line choose the line graph with connected points.
Excel will draw a Means Plot for you as shown below.
MEANS PLOT.

Weight
52
51
50
49

Weight

48
47
46
45
44
43
Men

Women

REGRESSION DATA ANALYSIS USING PSPP AND EPIDATA


ANALYSIS
Load Bangwe data. We want to fit a model of weight and age.
We will regress weight on age.
Weight will be the response (dependent) variable and age will be the independent variable.
To fit a regression model, proceed as follows:
1. Click on Analysis. Select Regress.
2. Select weight(wt) first.
3. Select age(approxag).
4.Click on Run.
Note that the second table from the regression analysis shows the estimates of 0 and 1 .
The last column contains p-values.
If the p-value for the intercept is less than the level of significance, it means that the intercept
should be retained in the model. Similarly, if the coefficient for age has a p-value which is less than
the level of significance, then age significantly influences/affects weight.
Study the first table from the regression output.
The column labelled Number of obs has R-squared and a corresponding value for R-squared is in
the column labelled 520. If you multiply the value of R-squared by 100, you get the percentage of
variation in the response variable which is explained by including age in the model. In this case,
100 x R2 = 2%. This implies that 2% of the variation in weight is explained by including age in the
regression model. R2 is called the coefficient of determination. Big values of R2 imply that the
regression model is good and can give reliable estimates if used in predicting the response which is
weight in our case. By big values, we mean values close to 100%.

Note that the column labeled t contains the results when the results labeled BETA are divided by
the results in the column labelled SE. Read the handout on simple regression.

You might also like