Classification of Data
Classification of Data
Classification of Data
REPRESENTATION OF DATA
Objective:
After studying this unit, reader should be able to
Classify data
Know about the stem and leaf diagram
Draw the stem and leaf diagram
Know about frequency distribution
Construct a frequency distribution.
Know relative and cumulative frequencies.
Construct relative and cumulative frequencies.
Know different graphs
Draw the graphs for the given data
Structure
7.1 Introduction
7.2 Classification of data
7.3 Construction of frequency table
7.4 Relative frequency distribution
7.5 Cumulative frequency distribution
7.6 Stem and leaf diagram
7.7 Diagrammatic representation of data
7.8 Key words
7.9 Suggested readings
7.10 Review exercise
7.1 INTRODUCTION
Statistics is a science of aggregates. It is science of the group of numbers. It deals with a set of
data. The collected data is called as the Raw data. Which is just the collection of the facts and
figures from the population or sample. These data is arranged, displayed, summarized, compiled
and analyzed. Compilation of data can be tabular, graphical.
In tabular compilation we represent the data in a table form. It may be a simple classification as
per some class interval. Graphs and charts are used to display the data diagrammatically
The collected data is classified using a tally marks. Tally marks are the small vertical lines used
as symbols to represent the number. The raw data that is collected is classified using the
frequency table.
Classification gives the summery of the data. It facilitates the comparison if any between the
attributes of the variable under consideration. Classification highlights the characteristics of the
data.
110
Basis for classification may be area wise, it can be time wise, it can be quality wise or it may be
quantity wise.
Area wise classification demands for the separation of the data as per the geographical areas. The
sales of different brands of colour televisions in different parts of country. This type of data may
give us information about the pattern of choices of the customer in different parts of the country.
Accordingly the colour television company can develop their strategy for advertising in different
parts of country in different media of advertising.
Time wise classification of data is arranging the data as chronological data. The time series data
is a classification of data with respect to time. Time based classification may have the basis for
classification as day or month or year or a decade etc. The sensitive index of a stock market has a
base of a day. The turn over of a company may have the base as a year.
Qualitative classification of the data is according to some attribute or characteristic that may not
be measured. Classification may be dichotomous i.e. in to categories as having and not having.
Classification may have different classes. The students in a class of MBA may be classified as
having graduation in Science, Commerce, Arts, and Engineering etc.
Quantitative classification is used when the characteristic under consideration is measurable. The
salary of employees, length of iron rods that are manufactured by a machine, quantity of a soft
drink that is dispensed by the machine etc.
Consider the data giving the details about the number of employees working in small scale
industries in a particular MIDC area.
21 22 26 24 28 21 23 25 26 25 25
26 28 21 24 26 23 25 21 25 26 23
24 28 26 28 28 25 26 25 21 24 29
28 23
To classify this data we use Tally marks. For each entry in the above data we put a vertical line as
tally mark to find how many 21’s are there, 22’s are there and so on. Four are vertical lines and
the fifth is a slanting that makes a bundle of five lines.
111
Number of employees Tally marks Frequency
21 5
22 1
23 4
24 4
25 7
26 7
27 ___ 0
28 6
29 1
* Total 35
The frequencies are written counting the number of tally marks against the number.
For continuous data we make non-overlapping sub classes called as class intervals. The
minimum and the maximum observation determine the range of the class intervals. The
number of classes should not be too large of too small. The number of classes should be
between 8-12. But there is no hard and fast rule.
The classes indicates the part of the whole range over which the observations are
scattered. Class is identified by its limits that are called as class limits or class
boundaries. The class limits restrict the numbers in the given range. The lower end of
the class is called as lower limit and the upper end of the class is called as upper limit.
Example: if the classes are 0-10; 10-20; 20-30 …70-80 the lower limit of class 20-30 is
20 and the upper limit of the class 20-30 is 30.
If the upper limit of the class is same as the lower limit of the class it is called as the
exclusive class interval. In such case the observation that is exact as the upper limit is
included in the next class where that number is a lower limit.
We may get the classes where the upper class limit of a class is not same as the lower
class limit of the successive class. Then we need to make the continuity correction. As in
112
case of the continuous variable there are always observations possible between any two
numbers. Non-similar upper and lower limits of the successive classes indicate that there
is discontinuity that restricts the variable to take values from that particular part of the
population.
Continuity correction is the following process of making the class intervals continuous,
with out breaks.
The continuity correction is applied by the following method. The part of the class
interval, which is not included in the classes, is evenly distributed in the classes by
extending the limits of the class.
The correction factor is the half of the difference of the lower limit of the second class
and the upper limit of the first class.
Activity:
113
Class interval……… …….. ……….. ………. ……….
The difference between the upper and the lower limit is called as the class width. The
class width indicates the span or size of the class. The class width should not be very
small or very large. If the class is 20-30; the class width is 30-20=10. Generally the
classes are formed such that the width is 5 or 10 or a multiple of 5.
The midpoint of the class is called as Class Mark. It is calculated as the sum of the upper
and lower limit divided by two
With the continuous data we are going to make the use of class mark very often.
Activity:
23 32 25 26 28 29 35 36 34 38 36
25 54 52 53 56 58 59 50 45 44 43
52 32 35 39 48 39 31 35 60 25 45
45 56 52 23 56 31 32 35 39 48 52
32 35 59 42 25 28
114
2456 2555 2600 2700 2850 3010 2566 2940 3650 2640 3120
2479 2680 2150 3200 4500 2790 2460 2510 2340 2860 2700
2170 2190 2350 2640 2890 2645 3500 2150 2350 2650
2450 2489 2310 2650 2480 2660 2330
Class interval Tally marks Frequency
The relative frequency gives the fraction of the total portion contained in a class. The
proportion of the observations lying in a non overlapping class intervals is shown in the
relative frequency distribution. For the data of n observations the relative frequency is
calculated as
frequency of the class
Relative frequency of a class =
total number of observations
Relative frequency distribution is the tabular summery of the relative frequencies for all
the classes.
Illustration
According to the Beverage Digest, Coke classic, Diet coke, Dr. Pepper, Pepsi cola and
sprite are the five top selling soft drinks. The data below shows the drinks selected by 50
soft drink purchases.
Frequency distribution of soft drink purchases
Soft drink frequency
Coke classic 19
Diet coke 8
Dr. Pepper 5
Pepsi cola 13
Sprite 5
115
Diet coke 8 0.16
Dr. Pepper 5 0.10
Pepsi cola 13 0.26
Sprite 5 0.10
Total 50 1.00
Exercise
The time in days required for completing year-end audits for a sample of 20 clients of an
accounting firm. Classify the data and find the relative frequency distribution.
Total 20 1.00
Exercise:
The doctor’s office staff has studied the waiting times for patients who arrive at the office
with are quest for emergency service. The following data were collected over one month
period. Waiting times are in minutes
2 5 10 12 4 4 5 7 11 8 9
8 12 21 6 8 7 13 18 3
Use classes of 0-4, 5-9 etc. Show the frequency distribution
Show relative frequency distribution
What is the proportion of patients needing emergency service have waiting time of 10-
14?
116
7.5 Cumulative frequency distribution
Less than equal to type cumulative frequency, and the frequency is less than or equal to
the upper limit of the class intervals. How many observations are less than or equal to 10,
they are 5. How many observations are less than or equal to 20, they are 5 + 12 = 17 and
so on. How many are less than or equal to 70, they are all 100.
In greater than or equal to type cumulative frequency, the number of observations which
are greater than or equal to the lower limit of the class interval are considered. Consider
the last class interval 60 – 70. How many observations are greater than or equal to 60,
they are 3. How many are greater than or equal to 50, they are 3 + 17 = 20. And in the
same way all 100 are greater than or equal to.
Exercise
117
10.14 4
15-19 8
20-24 5
25-29 2
30-34 1
Total 20
2 3 4 5
The stem and leaf diagram can be drown using the two digits as stem and one digit as
leaves as well as one digit as a stem and two digit as leaves in case of three digit
numbers.
Exercise
119
7.7 Diagrammatic presentation of data
The data presented using various diagram for better understanding of the patterns in the
data. The various graphs and diagrams are used depending upon the purpose and the
need.
The most common are Histograms, line diagrams, bar diagrams, frequency polygon,
cumulative frequency curves/ogives and pie diagrams.
Line diagrams: Line diagrams presents the two variable data. One variable is
plotted on the X axis and the second variable is plotted on the Y axis. The points
are joined using the lines . Line diagram gives the increase or decrease of the data .
Line diagrams are used for time series data, where year is plotted on the X axis
and the value of other variable is plotted on the Y axis which shows the general
tendency of the data as a whole.
Following data gives the wholesale price index for a certain period.
120
14
12
10
8
Series1
6
4
2
0
1994-95
1995-96
1996-97
1997-98
1998-99
1999-2000
2000-2001
121
Bar diagram: The data is presented using rectangular blocks, horizontal or vertical.
The bars diagrams are used for presenting the facts of the data. The articles in news
paper, magazines, journals etc. use these bar diagram to present the behaviour of the
characteristic/ attribute in over a space.
The vertical bar diagrams present the characteristics or attribute on the X axis and the
corresponding values on the y axis.
Horizontal bar diagram use the axes in reverse order.
We illustrate how to draw bar diagrams using the following example.
Example: Data below shows production of shirts in a manufacturing company is given
below
Year 1990 1991 1992 1993 1994 1995 1996
No. of shirts (‘00) 52 55 56 60 57 58 56
70
60
50
40
30
20
10
0
1990 1991 1992 1993 1994 1995 1996
The histograms are very commonly used to show the comparisons of the observations.
The height of the bars in the histograms is directly proportional to the quantity. The bar
diagrams can be multiple bar diagrams as well as divided bar diagrams.
In multiple bar diagrams two or more sets of interrelated data are represented. The
method remains the same as that of simple bar diagram. Some times bars are shaded or
given different colours as they are showing different items.
Divided bar Diagram: The parts of total are represented as small segments/parts of each
in large bars. The magnitude of the segment in a bar is directly proportional to the
quantity shown by it.
Illustration
122
The data given for different commodities for two families
Family income for two families is 1000 and 1200 respectively
The expenditures are as follows.
Commodity Expenditure
Family A Family B
Food 300 400
Clothing 250 200
Education 50 360
Others 380 300
Savings or deficit +20 -60
100%
80%
60%
40%
20%
0%
1 2
-20%
Exercise
The following data gives the oil-seeds crop production estimated for a season
Oilseed Production in lakh tones
Area A Area B
Ground nut 2.00 1.70
Soya bean 1.25 1.25
Sesame 0.75 0.20
123
Total 4.00 3.15
Pie diagram:
The circle of 360 degrees is divided in to sectors as per the share of the component. The
percentage of the components is converted in to corresponding degree and is shaded or
shown in different colours.
Illustration:
The class of Management in an institute has a constitution of students with the graduation
degree as follows. Represent this data as the Pie diagram
1
2
3
4
5
Exercise
124
Plot the pie chart for the data below:
Crop Area in
million hectares
Wheat 16.10
Rice 18.23
Jawar 3.50
Bajra 3.64
maize 1.60
Total 43.07
Frequency curve:
The frequency curve is a smooth curve representing the frequency distribution. On X axis
plot the class marks and on the Y axis plot the frequencies.
125
Cummulative frequency 25
20
15
10
0
1 2 3 4 5 6
upper class limits
Exercise
126
Ogives:
The cumulative frequency curves are called as the Ogives. The smooth curves can be
drawn for the less than or equal to type or greater than or equal to type. The ogives
represent how many observations lie below or above certain values in the distribution,
rather than recording the numbers within interval.
The general form of the ogives is as follows: On X axis we plot the class limits and along
Y-axis we plot the cumulative frequencies.
Class limits
Illustration:
70
Less than or equal to Cummulative
60
50
frequency
40
30
20
10
0
1 2 3 4 5 6
upper class limits
127
Greater than frequency curve
150 154 158 162 166
70
Greater than or equal to
Cummulative frequency
60
50
40
30
20
10
0
1 2 3 4 5 6
lower class limits
Exercise:
Following data relate to factory size according to employment. Draw a less than curve
and a more than curve for the above data.
Below given is the frequency distribution of weekly wages of 100 workers in a factory:
weekly wages no. of workers weekly wages no. of workers
120-124 3 145-149 10
128
125-129 5 150-154 8
130-134 12 155-159 5
135-139 23 160-164 3
140.144 31
Draw the ogive for the distribution and use it to determine the median wage of a worker
and verify the result by the formula.
Tally marks Tally marks are the small vertical lines used as symbols to represent the
number.
Class limits or class boundaries Class is identified by its limits that are called as class
limits or class boundaries.
Relative frequency: The relative frequency gives the fraction of the total portion
contained in a class.
Cumulative frequency: Cumulative frequency shows the number of data items with
values less than or equal to the upper class limit of each class and the number of data
items with values greater than or equal to the lower class limit of each class.
Anderson et al, Statistics for business and economics, eighth edition,2002, Thomson Asia
Pvt. Ltd. Singapore
Frank and Althoen, Statistics concept and applications,1994, Cambridge university press,
Cambridge
129
W.J.Stevenson, Business Statistics concept and applications, 1978, Harper and Row
publishers, New York, USA.
1. The following data give the income distribution of workers in two factories.
Construct a relative frequency distributions and cumulative frequency
distributions.
2. The following data give the income distribution of workers in two factories.
Which distribution shows more variability?
Income in1000Rs. 10-12 12-14 14-16 16-18 18-20 20-22 22-24
Factory 1 10 15 65 73 70 17 10
Factory 2 25 34 40 50 30 30 10
6. For the following data related to the age of the policy holder draw the histogram.
130
Age in years 20-25 25-30 30-35 35-40 40-45 45-50
No. of policy holders 8 12 24 16 15 5
8. The table below shows the annual sales ($ millions) of Speedcall mobile
phones of random sample of 150 outlets
.
Annual sale of Speedcall Number of Outlets
mobilephones ($million)
5-9 18
10-14 35
15-19 41
20-24 21
25-29 15
30-34 13
35-39 7
131