Chapter Two: Data Classification, Collection, Tabulation, and Presentation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

By: Asimamaw B. (MSc.

Chapter Two: Data Classification, Collection, Tabulation, and Presentation


2.1. Introduction
- Whenever a set of data that we have collected contains a large number of observations, the best
way to examine such data is to present it in some compact and orderly form.
- Such a need arises because data contained in a questionnaire are in a form which does not give
any idea about the salient features of the problem under study. Such data are not directly
suitable for analysis and interpretation.
- For this reason the data set is organized and summarized in such a way that patterns are
revealed and are more easily interpreted.
- Such an arrangement of data is known as the distribution of the data.
- Distribution is important because it reveals the pattern of variation and helps in a better
understanding of the phenomenon the data present.
2.2. Classification of Data

 Classification of data is the process of arranging data in groups/classes on the basis of certain
properties. The classification of statistical data serves the following purposes:
- It condenses the raw data into a form suitable for statistical analysis.
- It removes complexities and highlights the features of the data.
- It facilitates comparisons and in drawing inferences from the data. For example, if
university students in a particular course are divided according to sex, their results can be
compared.
- It provides information about the mutual relationships among elements of a data set. For
example, based on literacy and criminal tendency of a group of peoples, it can be
established whether literacy has any impact or not on criminal tendency.
- It helps in statistical analysis by separating elements of the data set into homogeneous.
 Requisites of Ideal Classification
The classification of data is decided after taking into consideration the nature, scope, and purpose
of the investigation. However, an ideal classification should have following characteristics:
 It should be unambiguous: it is necessary that the various classes should be so defined
that there is no room for confusion. There must be only one class for each element of
the data set. For example, if the population of the country is divided into two classes,
say literates and illiterates, then an exhaustive definition of the terms used would be
essential.
 Classes should be exhaustive and mutually exclusive: Each element of the data set
must belong to a class. For this, an extra class can be created with the title ‘others’ so
as to accommodate all the remaining elements of the data set. Each class should be
mutually exclusive so that each element must belong to only one class. For example,
classification of students according to the age: below 25 years and more than 20 years,
is not correct because students of age 20 to 25 may belong to both the classes.
By: Asimamaw B. (MSc.)

 It should be stable: The classification of a data set into various classes must be done
in such a manner that if each time an investigation is conducted, it remains unchanged
and hence the results of one investigation may be compared with that of another. For
example, classification of the country’s population by a census survey based on
occupation suffers from this defect because various occupations are defined in different
ways in successive censuses and, as such, these figures are not strictly comparable.
 It should be flexible: A classification should be flexible so that suitable adjustments
can be made in new situations and circumstances. However, flexibility does not mean
instability. The data should be divided into few major classes which must be further
subdivided. Ordinarily there would not be many changes in the major classes. Only
small sub-classes may need a change and the classification can thus retain the merit of
stability and yet have flexibility.
- The term stability does not mean rigidity of classes. The term is used in a relative
sense. One-time classification can not remain stable forever. With change in time,
some classes become obsolete and have to be dropped and fresh classes have to be
added. The classification may be called ideal if it can adjust itself to these changes
and yet retain its stability.
 Basis of Classification
Generally, data are classified on the basis of the following four bases:
 Geographical Classification: In geographical classification, data are classified on the basis
of geographical or locational differences such as—cities, districts, or villages between
various elements of the data set.
 Chronological Classification: When data are classified on the basis of time, the
classification is known as chronological classification. Such classifications are also called
time series because data are usually listed in chronological order starting with the earliest
period.
 Qualitative Classification: In qualitative classification, data are classified on the basis of
descriptive characteristics or on the basis of attributes like sex, literacy, region, caste, or
education, which cannot be quantified. This is done in two ways: (i) Simple classification:
In this type of classification, each class is subdivided into two sub-classes and only one
attribute is studied such as: male and female; blind and not blind, educated and uneducated,
and so on. (ii) Manifold classification: In this type of classification, a class is subdivided
into more than two sub-classes which may be sub-divided further. An example of this form
of classification is shown in the box:
By: Asimamaw B. (MSc.)

 Quantitative Classification: In this classification, data are classified on the basis of some
characteristics which can be measured such as height, weight, income, expenditure,
production, or sales.
- Quantitative variables can be divided into the following two types. The term
variable refers to any quantity or attribute whose value varies from one
investigation to another.
 Continuous variable is the one that can take any value within the range of
numbers. Thus the height or weight of individuals can be of any value within
the limits. In such a case data are obtained by measurement,
 Discrete (also called discontinuous) variable is the one whose values change
by steps or jumps and can not assume a fractional value. The number of
children in a family, number of workers (or employees), number of students
in a class, are few examples of a discrete variable. In such a case data are
obtained by counting.
Table 1: Examples of continuous and discrete variables in a data set

Discrete Series Continuous Series


Number of Number of Number of
Children Families Weight (kg) Persons

0 10 100 to 110 10
1 30 110 to 120 20
2 60 120 to 130 25
3 90 130 to 140 35
4 110 140 to 150 50
5 20
320 140

2.3. Methods of Data Collection


 Data collection – It is the process of obtaining measurements, counts or any other things by
experimentation or observation.
- It is the first stage in statistics/statistical analysis.
- The methods of collecting primary and secondary data differ since primary data are to
be originally collected, while in case of secondary data the nature of data collection
work is merely that of compilation.
 Methods of Collection of Primary Data
- Among the several methods of collection of primary data, the following methods
are mostly used:
 Observation method: in observational studies, the investigator does not ask questions to
seek clarifications on certain issues. Instead he records the behavior, as it occurs, of an
By: Asimamaw B. (MSc.)

event in which he is interested. Sometimes mechanical devices are also used to record the
desired data.
 Interview method: the interview method of collecting data involves presentation of oral-
verbal stimuli and reply in terms of oral-verbal responses. This method can be used through
personal interviews and, if possible, through telephone interviews.
 Questionnaire method: one of the most conventional methods of data collection,
particularly in wider areas having big inquiries, is the questionnaire method of primary data
collection.
- In this method, a questionnaire is prepared be fitting to the objective of the study
and sent generally by post to the respondents with a request to answer the
questionnaires.
- A questionnaire consists of a number of questions printed or typed in a definite
order on a form or set of forms.
- The questionnaire is mailed to respondents who are expected to read and
understand the questions and write down the reply in the space meant for the
purpose in the questionnaire itself.
- The respondents have to answer the questions on their own.
 Schedule method: is the tool or instrument used to collect data from the respondents
while interview is conducted.
- Schedule contains questions, statements (on which opinions are elicited) and blank
spaces/tables for filling up the respondents.
- Schedule is the name usually applied to a set of questions which are asked and filled
in by an interviewer in a face to face situation with another person.
- This method of data collection is very much like the collection of data through
questionnaire, with little difference which lies in the fact that schedules (proforma
containing a set of questions) are being filled in by the enumerators who are
specially appointed for the purpose.
- These enumerators along with schedules, go to respondents, put to them the
questions from the proforma in the order the questions are listed and record the
replies in the space meant for the same in the proforma.
 Other methods such as:
- Warranty cards: Warranty cards are usually postal sized cards which are used by
dealers of consumer durables to collect information regarding their products.
- Distributor or store audits: Distributor or store audits are performed by
distributors as well as manufactures through their salesmen at regular intervals.

-
Pantry audits: Pantry audit technique is used to estimate consumption of the basket
of goods at the consumer level. In this type of audit, the investigator collects an
inventory of types, quantities, and prices of commodities consumed.
 Methods of collection of Secondary data
- Secondary data may either be published data or unpublished data.
By: Asimamaw B. (MSc.)

 Usually published data are available in: (a) various publications of the central, state are
local governments; (b) various publications of foreign governments or of international
bodies and their subsidiary organizations; (c) technical and trade journals; (d) books,
magazines and newspapers; (e) reports and publications of various associations connected
with business and industry, banks, stock exchanges, etc.; (f) reports prepared by research
scholars, universities, economists, etc. in different fields; and (g) public records and
statistics, historical documents, and other sources of published information.
 The sources of unpublished data are many; they may be found in diaries, letters,
unpublished biographies and autobiographies and also may be available with scholars and
research workers, trade associations, labor bureaus and other public/ private individuals
and organizations.

2.4. Organizing Data


 The best way to examine a large set of numerical data is first to organize and present it in an
appropriate, frequency distribution, tabular and graphical format, etc.
- To describe situations, draw conclusions, or make inferences about events, the data must
be organized in some meaningful way.
- After organizing the data, data must be presented so they can be understood by those who
will benefit from reading the study.
- The most useful method of presenting the data is by constructing statistical charts and
graphs.
- There are many different types of charts and graphs, and each one has a specific purpose.
- By so organizing the data, we can better identify trends, patterns, and other characteristics
that would not be apparent during a simple shuffle through a pile of questionnaires or
other data collection forms.
- Such summarization also helps us compare data that have been collected at different
points in time, by different researchers, or from different sources.
- It can be very difficult to reach conclusions unless we simplify the mass of numbers
contained in the original data.
2.4.1. The Frequency Distribution
 The frequency distribution is a table that divides the data values into classes and shows the
number of observed values that fall into each class.
- By converting data to a frequency distribution, we gain a perspective that helps us see
the forest instead of the individual trees.
- In the creation of a frequency distribution, scores are usually grouped into class
intervals, or ranges of numbers.
o Example: Here are 50 scores on a test of statistics for Business on which a frequency
distribution is based:
By: Asimamaw B. (MSc.)

Table 2: data of 50 scores on a test of statistics for Business

And here’s the frequency distribution. You can see that for each range of scores, there are
associated frequency counts.
Table 3: frequency distribution of 50 scores on a test of statistics for Business
By: Asimamaw B. (MSc.)

- Class interval is a range of numbers, and the first step in the creation of a frequency
distribution is to define how large each interval will be.
 Simply put, there are no hard-and-fast rules about creating class intervals on the
way to creating a frequency distribution. Here are six general rules:
 Decide on the number of class intervals.
o The following two rules are often used to decide approximate number of classes in a frequency
distribution:
I. If k represents the number of classes and N the total number of
observations, then the value of k will be the smallest exponent of the
number 2, so that 2𝑘 ≥ N.
o Example: we have N = 30 observations. If we apply this rule, then we shall have 23 = 8 (< 30);
24 =16 (< 30); 25 =32 (> 30). Thus we may choose k = 5 as the number of classes.
II. According to Sturge’s rule, the number of classes can be determined by
the formula.
k =1 + 3.222 loge N
Where k is the number of classes and loge N is the logarithm of the total
number of observations.
Applying this rule, we get
k =1 + 3.222 log 30
=1 + 3.222 (1.4771) = 5.759 ≅ 5

 Decide on the width of the class interval.


o Width of each class interval should be equal in size. The size (or width) of each class interval
can be determined by first taking the difference between the largest and smallest numerical
values in the data set and then dividing it by the number of class intervals desired.
𝐿𝑎𝑟𝑔𝑒𝑠𝑡 𝑛𝑢𝑚𝑒𝑟𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 − 𝑆𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑛𝑢𝑚𝑒𝑟𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒
Width of class interval (h) = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 𝑑𝑒𝑠𝑖𝑟𝑒𝑑
 Determine class limits (Boundaries).
o The limits of each class interval should be clearly defined so that each observation (element)
of the data set belongs to one and only one class. Each class has two limits:- a lower limit and
an upper limit.
 Create the class intervals.
 Put the data into the class intervals.
 Once class intervals are created, it’s time to complete the frequency part of the
frequency distribution. That’s simply counting the number of times a score occurs
in the raw data and entering that number in each of the class intervals represented
by the count.
 Cumulating Frequencies
- Once you have created a frequency distribution and have visually represented those data
using a histogram or a frequency polygon, another option is to create a visual representation
By: Asimamaw B. (MSc.)

of the cumulative frequency of occurrences by class intervals. This is called a cumulative


frequency distribution.
- A cumulative frequency distribution is based on the same data as a frequency distribution
but with an added column (Cumulative Frequency), as shown below.

Table 4: Cumulative frequency for 50 scores on a test of statistics for Business

Table 5: The frequency distribution of the number of hours of overtime


Number of Overtime Number of Weeks
Hours Tally (Frequency)

84 || 2
85 || 2
86 — 0
87 | 1
88 |||| 4
89 ||| 3
90 || 2
91 || 2
92 || 2
93 ||||| 6
94 |||| 5
95 | 1
30
By: Asimamaw B. (MSc.)

2.4.2. Tabulation of Data


 Tabulation is another way of summarizing and presenting the given data in a systematic form
in rows and columns. Such presentation facilitates comparison by bringing related information
close to each other and helps in further statistical analysis and interpretation.
- Tabulation is the logical listing of related quantitative data in vertical columns and
horizontal rows of numbers with sufficient explanatory and qualifying words, phrases
and statements in the form of titles, headings, and explanatory notes to make clear the
full meaning, context, and the origin of the data.
- Tables are means of recording in permanent form the analysis that is made through
classification and by placing in just a position things that are similar and should be
compared.
- The major objectives of tabulation are:
 To simplify the complex data: Tabulation presents the data set in a systematic and
concise form avoiding unnecessary details. The idea is to reduce the bulk of
information (data) under investigation into a simplified and meaningful form.
 To economize space: By condensing data in a meaningful form, space is saved
without sacrificing the quality and quantity of data.
 To depict trend: Data condensed in the form of a table reveal the trend or pattern
of data which otherwise cannot be understood in a descriptive form of presentation.
 To facilitate comparison: Data presented in a tabular form, having rows and
columns, facilitate quick comparison among its observations.
 To facilitate statistical comparison: Tabulation is a phase between classification
of data and its presentation. Various statistical techniques such as measures of
average and dispersion, correlation and regression, time series, and so on can be
applied to analyze data and then interpreting the results.
 To help reference: When data are arranged in tables in a suitable form, they can
easily be identified and can also be used as reference for future needs.
- Parts of a Table
 Table number: A table should be numbered for easy identification and reference in
future
 Title of the table: Each table must have a brief, self-explanatory, and complete title.
 Caption and stubs: The heading for columns and rows are called caption and stub,
respectively. They must be clear and concise.
 Body: The body of the table should contains the numerical information. The
numerical information is arranged according to the descriptions given for each
column and row.
By: Asimamaw B. (MSc.)

 Prefactory or head note: If need be, a prefactory note is given just below the title
for its further description in a prominent type. It is usually enclosed in brackets and
is about the unit of measurement.
 Foot notes: Anything written below the table is called a footnote. It is written to
further clarify either the title captions or stubs.
o Example: the educational difference among household who participated in off-farm activities
and who did not participated were presented in the following table.

Table 6: Educational Difference in participant and non-participant households in off-farm


activities
Participant(P) Non-participant(NP)
Variable Mean Std. Err. Obs. Mean Std. Err. obs. Diff = (NP-P)
Ever attendance ( 1 = attended) 0.67 0.01 3555 0.63 0.00 9533 -0.04*
Enrollment status 0.91 0.01 2397 0.93 0.00 5977 0.02*
(1 = currently enrolled)
Basic literacy skill (1 = read 0.47 0.01 3555 0.45 0.01 9534 -0.02***
and write)
Absence(1 = absent) 0.10 0.01 2171 0.11 0.00 5567 0.01
Highest grade completed 1.42 0.03 3,594 1.36 0.02 9,614 -0.08***
Grade attainment relative to 0.47 0.04 3,587 0.58 0.02 9,577 0.11*
age
Mean schooling of male 1.84 0.03 3,536 1.59 0.02 9,494 -0.25*
Mean schooling of female 1.59 0.03 3,566 1.41 0.02 9,536 -0.18*
Head's schooling 2.10 0.05 3,569 1.64 0.03 9,555 -0.45*
Note: * and ***, shows statistically significant variables at 1% and 10% level of significance, respectively.
Source: Authors’ computation based on 2011/12, 2013/14, and 2015/16 ESS data.

2.4.3. Graphical Presentation of Data


 One of the important functions of statistics is to present complex and unorganized (raw) data
in such a manner that they would easily be understandable. According to King, ‘One of the
chief aims of statistical science is to render the meaning of masses of figures clear and
comprehensible at a glance.’ This is often best accomplished by presenting the data in a
pictorial (or graphical) form.
- Graphic presentation of frequency distributions facilitates easy understanding of data
presentation and interpretation.
- The shape of the graph gives an exact idea of the variations of the distribution trends.
Graphic presentation, therefore, serves as an easy technique for quick and effective
comparison between two or more frequency distributions.
- When the graph of one frequency distribution is superimposed on the other, the points
of contrast regarding the type of distribution and the pattern of variation become quite
obvious.
By: Asimamaw B. (MSc.)

- All these advantages necessitate a clear understanding of the various forms of graphic
representation of a frequency distribution.
o Example: Trend of inflation in Ethiopia shown in the following graph.

Figure1: Trend of inflation of inflation in Ethiopia

Inflation trend in Ethiopia


60
50
40
30
20
10
0
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
-10
-20

INF

Source: National Bank of Ethiopia (2020)

2.4.4. Diagrammatic Presentation of Data


 Diagrammatic form of representation is more convincing and appealing than the other forms
of data presentation.
- This form of presentation is easily understood by any person, layman, as well as an
educated person.
- Different diagrammatic forms of presentation are (a) line diagram, (b) bar diagram, (c)
histogram, (d) frequency polygon, (e) cumulative frequency curve or Ogive, (f) pie
charts, (g) pictorial diagrams, (h) maps, etc.; within each type, there may be variant
types.
 General Rules for Drawing Diagrams
 To draw useful inferences from graphical presentation of data, it is important to understand
how they are prepared and how they should be interpreted.
By: Asimamaw B. (MSc.)

 When we say that ‘one picture is worth a thousand words’, it neither proves (nor disproves)
a particular fact, nor is it suitable for further analysis of data.
 However, if diagrams are properly drawn, they highlight the different characteristics of
data.
 The following general guidelines are taken into consideration while preparing diagrams:
- Title: Each diagram should have a suitable title. It may be given either at the top of
the diagram or below it.
- Size: The size and portion of each component of a diagram should be such that all
the relevant characteristics of the data are properly displayed and can be easily
understood.
- Proportion of length and breadth: An appropriate proportion between the length
and breadth of the diagram should be maintained.
- Proper scale: There are again no fixed rules for selection of scale. The diagram
should neither be too small nor too large. The scale for the diagram should be
decided after taking into consideration the magnitude of data and the size of the
paper on which it is to be drawn. The scale showing the values as far as possible,
should be in even numbers or in multiples of 5, 10, 20, and so on. The scale should
specify the size of the unit and the nature of data it represents, for example,
‘millions of tonnes’, in Rs thousand, and the like. The scale adopted should be
indicated on both vertical and horizontal axes if different scales are used. Otherwise
can be indicated at some suitable place on the graph paper.
- Footnotes and source note: To clarify or elucidate any points which need further
explanation but cannot be shown in the graph, footnotes are given at the bottom of
the diagrams.
- Index: A brief index explaining the different types of lines, shades, designs, or
colours used in the construction of the diagram should be given to understand its
contents.
- Simplicity: Diagrams should be prepared in such a way that they can be understood
easily. To keep it simple, too much information should not be loaded in a single
diagram as it may create confusion.
 variant types.
A. A frequency line for discrete as well as for continuous distributions can be represented
graphically by drawing ordinates equal to the frequency on a convenient scale at different
values of the variable, X. For the example of yield, we shall have different yield classes on
the horizontal X-axis and frequencies on the vertical Y-axis as shown in Fig. 2.
B. Bar diagram: Instead of drawing a line joining the class frequencies, one represents the
frequencies in the form of bars. In bar diagrams, equal bases on a horizontal (or vertical) line
are selected, and rectangles are constructed with length proportional to the given frequencies
on a suitably chosen scale. The bars should be drawn at equal distances from one another (Fig.
3).
By: Asimamaw B. (MSc.)

Fig. 2 Line diagram

Fig. 3: Bar graph

C. Histogram: Histogram is almost similar to that of a bar diagram for discrete data; the only
thing is that the reflection of nonexistence of any gap between two consecutive classes is also
reflected by leaving no gap between two consecutive bars. Continuous grouped data are
usually represented graphically by a histogram. The rectangles are drawn with bases
corresponding to the true class intervals and with heights proportional to the frequencies. With
all the class intervals equal, the areas of a rectangle also represent the corresponding
frequencies. If the class intervals are not all equal, then the heights are to be suitably adjusted
to make the area proportional to the frequencies (Fig. 4).

Fig. 4: Yield frequency histogram of 130 varieties of paddy


D. Frequency Polygon: If the midpoints of the top of the bars in histogram are joined by straight
lines, then a frequency polygon is obtained.
By: Asimamaw B. (MSc.)

Fig. 5: Yield frequency polygon and histogram of 130 varieties of paddy

E. Pie Chart: The basic idea behind the formation of a pie diagram is to take the whole
frequencies in 100% and present it in a circle with 360 angle at the center. In the frequency
distribution table, ordinary frequency or relative frequency can effectively be used in the form
of a pie diagram. Thus, for example, the yield data following a pie chart is prepared with class
frequencies (Fig. 6).

Fig. 6: Pie diagram of yield frequency for 130 varieties of paddy

F. Cumulative Frequency Curve (Ogive): Partitioning the whole data set can very well be made
with the help of a cumulative frequency graph, also known as OGIVE.
G. Pictorial Diagram: To make the information lively and easy to understand by any user,
sometimes information is presented in pictorial forms. Instead of a bar diagram or line
diagram or pie chart, one can use pictures in the diagrams.
H. Maps: Statistical maps are generally used to represent the distribution of particular parameters
like a forest area in a country, paddy-producing zone, and different mines located at different
places in a country, rainfall pattern, population density, etc.
By: Asimamaw B. (MSc.)

References
 Allan G. Bluman (2012). Elementary Statistics: Step by Step Approach. Eighth Edition.
McGraw-Hill.
 David R. Anderson; Dennis J. Sweeney; Thomas A. Williams; Jim Freeman; Eddie Shoesmith
(2014). Statistics for Business and Economics. Third edition. Andrew Ashwin.
 J. K. Sharma (2007). Business Statistics. Second Edition. Pearson Education.
 Leonard J. Kazmier (2004). Schaum’s Outline of Theory and Problems of Business Statistics.
Fourth Edition. McGraw-Hill.
 Mark L. Berenson, David M. Levine, Timothy C. Krehbiel (2011). Basic Business Statistics:
Concepts and Applications. Twelfth Edition. Pearson Education
 Neil J. Salkind (2016). Statistics for People Who (Think They) Hate Statistics. Sage
Publications, Inc.

“End of Chapter Two”

You might also like