Data Structure-Unit-V
Data Structure-Unit-V
Data Structure-Unit-V
INTERNAL – SORTING
Internal Sorting
An internal sort is any data sorting process that takes place entirely within the main memory of
a computer. This is possible whenever the data to be sorted is small enough to all be held in the
main memory. like a hard-disk. Any reading or writing of data to and from this slower media can
slow the sortation process considerably. This issue has implications for different sort algorithms.
Some common internal sorting algorithms include:
1. Bubble Sort
2. Insertion Sort
3. Quick Sort
4. Heap Sort
5. Radix Sort
6. Selection sort
Consider a Bubble sort, where adjacent records are swapped in order to get them into the right
order, so that records appear to “bubble” up and down through the dataspace. If this has to be
done in chunks, then when we have sorted all the records in chunk 1, we move on to chunk 2, but
we find that some of the records in chunk 1 need to “bubble through” chunk 2, and vice versa
(i.e., there are records in chunk 2 that belong in chunk 1, and records in chunk 1 that belong in
chunk 2 or later chunks). This will cause the chunks to be read and written back to disk many
times as records cross over the boundaries between them, resulting in a considerable degradation
of performance. If the data can all be held in memory as one large chunk, then this performance
hit is avoided.
On the other hand, some algorithms handle external sorting rather better. A Merge sort breaks
the data up into chunks, sorts the chunks by some other algorithm (maybe bubblesort or Quick
sort) and then recombines the chunks two by two so that each recombined chunk is in order. This
approach minimizes the number or reads and writes of data-chunks from disk, and is a popular
external sort method.
Insertion Sort
This is an in-place comparison-based sorting algorithm. Here, a sub-list is maintained which is
always sorted. For example, the lower part of an array is maintained to be sorted. An element
which is to be 'insert'ed in this sorted sub-list, has to find its appropriate place and then it has to
be inserted there. Hence the name, insertion sort.
The array is searched sequentially and unsorted items are moved and inserted into the sorted sub-
list (in the same array). This algorithm is not suitable for large data sets as its average and worst
case complexity are of Ο(n2), where n is the number of items.
It finds that both 14 and 33 are already in ascending order. For now, 14 is in sorted sub-list.
It swaps 33 with 27. It also checks with all the elements of sorted sub-list. Here we see that the
sorted sub-list has only one element 14, and 27 is greater than 14. Hence, the sorted sub-list
remains sorted after swapping.
By now we have 14 and 27 in the sorted sub-list. Next, it compares 33 with 10.
We swap them again. By the end of third iteration, we have a sorted sub-list of 4 items.
This process goes on until all the unsorted values are covered in a sorted sub-list. Now we shall
see some programming aspects of insertion sort.
Algorithm
Output:
Quick sort
Sorting is a way of arranging items in a systematic manner. Quicksort is the widely used sorting
algorithm that makes n log n comparisons in average case for sorting an array of n elements. It is
a faster and highly efficient sorting algorithm. This algorithm follows the divide and conquer
approach. Divide and conquer is a technique of breaking down the algorithms into subproblems,
then solving the subproblems, and combining the results back together to solve the original
problem.
Divide: In Divide, first pick a pivot element. After that, partition or rearrange the array into two
sub-arrays such that each element in the left sub-array is less than or equal to the pivot element
and each element in the right sub-array is larger than the pivot element.
Quicksort picks an element as pivot, and then it partitions the given array around the picked
pivot element. In quick sort, a large array is divided into two arrays in which one holds values
that are smaller than the specified value (Pivot), and another array holds the values that are
greater than the pivot.
After that, left and right sub-arrays are also partitioned using the same approach. It will continue
until the single element remains in the sub-array.
Picking a good pivot is necessary for the fast implementation of quicksort. However, it is typical
to determine a good pivot. Some of the ways of choosing a pivot are as follows -
Pivot can be random, i.e. select the random pivot from the given array.
Pivot can either be the rightmost element of the leftmost element of the given array.
Select median as the pivot element.
Working of Quick Sort Algorithm
To understand the working of quick sort, let's take an unsorted array. It will make the concept
more clear and understandable.
In the given array, we consider the leftmost element as pivot. So, in this case, a[left] =
24, a[right] = 27 and a[pivot] = 24.
Since, pivot is at left, so algorithm starts from right and move towards left.
Now, a[pivot] < a[right], so algorithm moves forward one position towards left, i.e. –
Because, a[pivot] > a[right], so, algorithm will swap a[pivot] with a[right], and pivot moves to
right, as -
Now, a[left] = 19, a[right] = 24, and a[pivot] = 24. Since, pivot is at right, so algorithm starts
from left and moves to right.
As a[pivot] > a[left], so algorithm moves one position to right as -
Now, a[left] = 9, a[right] = 24, and a[pivot] = 24. As a[pivot] > a[left], so algorithm moves one
position to right as -
Now, a[left] = 29, a[right] = 24, and a[pivot] = 24. As a[pivot] < a[left], so, swap a[pivot] and
a[left], now pivot is at left, i.e. -
Since, pivot is at left, so algorithm starts from right, and move to left. Now, a[left] = 24, a[right]
= 29, and a[pivot] = 24. As a[pivot] < a[right], so algorithm moves one position to left, as -
Now, a[pivot] = 24, a[left] = 24, and a[right] = 14. As a[pivot] > a[right], so, swap a[pivot] and
a[right], now pivot is at right, i.e. -
Now, a[pivot] = 24, a[left] = 14, and a[right] = 24. Pivot is at right, so the algorithm starts from
left and move to right.
Now, a[pivot] = 24, a[left] = 24, and a[right] = 24. So, pivot, left and right are pointing the same
element. It represents the termination of procedure.
Element 24, which is the pivot element is placed at its exact position.
Elements that are right side of element 24 are greater than it, and the elements that are left side of
element 24 are smaller than it.
Now, in a similar manner, quick sort algorithm is separately applied to the left and right sub-
arrays. After sorting gets done, the array will be -
Implementation of quicksort
#include <stdio.h>
/* function that consider last element as pivot,
place the pivot at its exact position, and place
smaller elements to left of pivot and greater
elements to right of pivot. */
int partition (int a[], int start, int end)
{
int pivot = a[end]; // pivot element
int i = (start - 1);
Output:
Heap Sort
Heap sort processes the elements by creating the min-heap or max-heap using the elements of
the given array. Min-heap or max-heap represents the ordering of array in which the root element
represents the minimum or maximum element of the array.Heap sort basically recursively
performs two main operations -
What is a heap?
A heap is a complete binary tree, and the binary tree is a tree in which the node can have the
utmost two children. A complete binary tree is a binary tree in which all the levels except the last
level, i.e., leaf node, should be completely filled, and all the nodes should be left-justified.
Heap sort is a popular and efficient sorting algorithm. The concept of heap sort is to eliminate the
elements one by one from the heap part of the list, and then insert them into the sorted part of the
list. Heap sort is the in-place sorting algorithm.
In heap sort, basically, there are two phases involved in the sorting of elements. By using the
heap sort algorithm, they are as follows -
o The first step includes the creation of a heap by adjusting the elements of the array.
o After the creation of heap, now remove the root element of the heap repeatedly by
shifting it to the end of the array, and then store the heap structure with the remaining
elements.
Now let's see the working of heap sort in detail by using an example. To understand it more
clearly, let's take an unsorted array and try to sort it using heap sort. It will make the explanation
clearer and easier.
First, we have to construct a heap from the given array and convert it into max heap.
After converting the given heap into max heap, the array elements are -
Next, we have to delete the root element (89) from the max heap. To delete this node, we have to
swap it with the last node, i.e. (11). After deleting the root element, we again have to heapify it to
convert it into max heap.
After swapping the array element 89 with 11, and converting the heap into max-heap, the
elements of array are -
In the next step, again, we have to delete the root element (81) from the max heap. To delete this
node, we have to swap it with the last node, i.e. (54). After deleting the root element, we again
have to heapify it to convert it into max heap.
After swapping the array element 81 with 54 and converting the heap into max-heap, the
elements of array are -
In the next step, we have to delete the root element (76) from the max heap again. To delete this
node, we have to swap it with the last node, i.e. (9). After deleting the root element, we again
have to heapify it to convert it into max heap.
After swapping the array element 76 with 9 and converting the heap into max-heap, the elements
of array are -
In the next step, again we have to delete the root element (54) from the max heap. To delete this
node, we have to swap it with the last node, i.e. (14). After deleting the root element, we again
have to heapify it to convert it into max heap.
After swapping the array element 54 with 14 and converting the heap into max-heap, the
elements of array are -
In the next step, again we have to delete the root element (22) from the max heap. To delete this
node, we have to swap it with the last node, i.e. (11). After deleting the root element, we again
have to heapify it to convert it into max heap.
After swapping the array element 22 with 11 and converting the heap into max-heap, the
elements of array are -
In the next step, again we have to delete the root element (14) from the max heap. To delete this
node, we have to swap it with the last node, i.e. (9). After deleting the root element, we again
have to heapify it to convert it into max heap.
After swapping the array element 14 with 9 and converting the heap into max-heap, the elements
of array are -
In the next step, again we have to delete the root element (11) from the max heap. To delete this
node, we have to swap it with the last node, i.e. (9). After deleting the root element, we again
have to heapify it to convert it into max heap.
After swapping the array element 11 with 9, the elements of array are -
Now, heap has only one element left. After deleting it, heap will be empty.
#include <stdio.h>
/* function to heapify a subtree. Here 'i' is the
index of root node in array a[], and 'n' is the size of heap. */
void heapify(int a[], int n, int i)
{
int largest = i; // Initialize largest as root
int left = 2 * i + 1; // left child
int right = 2 * i + 2; // right child
// If left child is larger than root
if (left < n && a[left] > a[largest])
largest = left;
// If right child is larger than root if (right < n && a[right] > a[largest])
largest = right;
// If root is not largest
if (largest != i) {
// swap a[i] with a[largest]
int temp = a[i];
a[i] = a[largest];
a[largest] = temp;
heapify(a, n, largest);
}
}
/*Function to implement the heap sort*/
void heapSort(int a[], int n)
{
for (int i = n / 2 - 1; i >= 0; i--) heapify(a, n, i);
// One by one extract an element from heap
for (int i = n - 1; i >= 0; i--) {
/* Move current root element to end*/
// swap a[0] with a[i]
int temp = a[0];
a[0] = a[i];
a[i] = temp;
heapify(a, i, 0);
}
}
/* function to print the array elements */
void printArr(int arr[], int n)
{
for (int i = 0; i < n; ++i)
{
printf("%d", arr[i]);
printf(" ");
}
}
int main()
{
int a[] = {48, 10, 23, 43, 28, 26, 1};
int n = sizeof(a) / sizeof(a[0]);
printf("Before sorting array elements are - \n");
printArr(a, n);
heapSort(a, n);
printf("\nAfter sorting array elements are - \n");
printArr(a, n);
return 0;
}
Output
Shell sort
Shell sort is the generalization of insertion sort, which overcomes the drawbacks of insertion sort
by comparing elements separated by a gap of several positions. It is a sorting algorithm that is an
extended version of insertion sort. Shell sort has improved the average time complexity of
insertion sort. As similar to insertion sort, it is a comparison-based and in-place sorting
algorithm. Shell sort is efficient for medium-sized data sets. In insertion sort, at a time, elements
can be moved ahead by one position only. To move an element to a far-away position, many
movements are required that increase the algorithm's execution time. But shell sort overcomes
this drawback of insertion sort. It allows the movement and swapping of far-away elements as
well. This algorithm first sorts the elements that are far away from each other, then it
subsequently reduces the gap between them. This gap is called as interval. This interval can be
calculated by using the Knuth's formula given below –
hh = h * 3 + 1
where, 'h' is the interval having initial value 1.
In the first loop, n is equal to 8 (size of the array), so the elements are lying at the interval of 4
(n/2 = 4). Elements will be compared and swapped if they are not in order.
Here, in the first loop, the element at the 0th position will be compared with the element at
4th position. If the 0th element is greater, it will be swapped with the element at 4th position.
Otherwise, it remains the same. This process will continue for the remaining elements.
At the interval of 4, the sub lists are {33, 12}, {31, 17}, {40, 25}, {8, 42}.
Now, we have to compare the values in every sub-list. After comparing, we have to swap them if
required in the original array. After comparing and swapping, the updated array will look as
follows -
In the second loop, elements are lying at the interval of 2 (n/4 = 2), where n = 8.
Now, we are taking the interval of 2 to sort the rest of the array. With an interval of 2, two
sublists will be generated - {12, 25, 33, 40}, and {17, 8, 31, 42}.
Now, we again have to compare the values in every sub-list. After comparing, we have to swap
them if required in the original array. After comparing and swapping, the updated array will look
as follows -
In the third loop, elements are lying at the interval of 1 (n/8 = 1), where n = 8. At last, we use the
interval of value 1 to sort the rest of the array elements. In this step, shell sort uses insertion sort
to sort the array elements.
#include <stdio.h>
/* function to implement shellSort */
int shell(int a[], int n)
{
/* Rearrange the array elements at n/2, n/4, ..., 1 intervals */
for (int interval = n/2; interval > 0; interval /= 2)
{ for (int i = interval; i < n; i += 1)
{
/* store a[i] to the variable temp and make the ith position empty */
int temp = a[i];
int j;
for (j = i; j >= interval && a[j - interval] > temp; j -= interval)
a[j] = a[j - interval];
Output
Another way, which trades memory efficiency for time efficiency in the case where the directory
doesn't change very often, is to have a separate array for each list order, as shown below.
Those of you with experience using data base programs may recognize this technique. Each
entry in the data base typically contains a bunch of fields, and the data base program maintains
index arrays that allow the entries to be listed by one field or another.
Files
What is File?
File is a collection of records related to each other. The file size is limited by the size of memory
and storage medium.Files are not data structures, but they can be containers to hold data
structures.
A file is a collection of records involving a set of entities with certain aspects in common and
organized for some particular purpose. There are three file organizations to understand the
relationship between Fields, Records and Files- sequential, Indexed sequential and relative. Data
files contain data and objects such as tables, indexes, stored procedures, and views. Log files
contain the information that is required to recover all transactions in the database. A file structure
is a combination of representations for data in files, and it enables applications to read, write, and
modifydata.
1. File Activity
2. File Volatility
File activity specifies percent of actual records which proceed in a single run.
File volatility addresses the properties of record changes. It helps to increase the efficiency of
disk design than tape.
File Organization
File organization ensures that records are available for processing. It is used to determine an
efficient file organization for each base relation.
For example, if we want to retrieve employee records in alphabetical order of name. Sorting the
file by employee name is a good file organization. However, if we want to retrieve all employees
whose marks are in a certain range, a file is ordered by employee name would not be a good file
organization.
Direct access file helps in online transaction processing system (OLTP) like online railway
reservation system.
In direct access file, sorting of the records are not required.
It accesses the desired records immediately.
It updates several files quickly.
It has better control over record allocation.
Disadvantages of direct access file organization
In indexed sequential access file, sequential file and random file access is possible.
It accesses the records very fast if the index table is properly organized.
The records can be inserted in the middle of the file.
It provides quick access for sequential and direct processing.
It reduces the degree of the sequential search.
Indexed sequential access file requires unique keys and periodic reorganization.
Indexed sequential access file takes longer time to search the index for the data access or
retrieval.
It requires more storage space.
It is expensive because it requires special software.
It is less efficient in the use of storage space as compared to other file organizations.
Index Techniques
o Indexing is used to optimize the performance of a database by minimizing the number of
disk accesses required when a query is processed.
o The index is a type of data structure. It is used to locate and access the data in a database
table quickly.
Index structure:
o The first column of the database is the search key that contains a copy of the primary key
or candidate key of the table. The values of the primary key are stored in sorted order so
that the corresponding data can be accessed easily.
o The second column of the database is the data reference. It contains a set of pointers
holding the address of the disk block where the value of the particular key can be found.
Indexing Methods
Ordered indices
The indices are usually sorted to make searching faster. The indices which are sorted are known
as ordered indices.
Example: Suppose we have an employee table with thousands of record and each of which is 10
bytes long. If their IDs start with 1, 2, 3....and so on and we have to search student with ID-543.
o In the case of a database with no index, we have to search the disk block from starting till
it reaches 543. The DBMS will read the record after reading 543*10=5430 bytes.
o In the case of an index, we will search using indexes and the DBMS will read the record
after reading 542*2= 1084 bytes which are very less compared to the previous case.
Primary Index
o If the index is created on the basis of the primary key of the table, then it is known as
primary indexing. These primary keys are unique to each record and contain 1:1 relation
between the records.
o As primary keys are stored in sorted order, the performance of the searching operation is
quite efficient.
o The primary index can be classified into two types: Dense index and Sparse index.
Dense index
o The dense index contains an index record for every search key value in the data file. It
makes searching faster.
o In this, the number of records in the index table is same as the number of records in the
main table.
o It needs more space to store index record itself. The index records have the search key
and a pointer to the actual record on the disk.
Sparse index
o In the data file, index record appears only for a few items. Each item points to a block.
o In this, instead of pointing to each record in the main table, the index points to the records
in the main table in a gap.
Clustering Index
o A clustered index can be defined as an ordered data file. Sometimes the index is created
on non-primary key columns which may not be unique for each record.
o In this case, to identify the record faster, we will group two or more columns to get the
unique value and create index out of them. This method is called a clustering index.
o The records which have similar characteristics are grouped, and indexes are created for
these group.
Example: suppose a company contains several employees in each department. Suppose we use a
clustering index, where all employees which belong to the same Dept_ID are considered within a
single cluster, and index pointers point to the cluster as a whole. Here Dept_Id is a non-unique
key.
The previous schema is little confusing because one disk block is shared by records which
belong to the different cluster. If we use separate disk block for separate clusters, then it is called
better technique.
Secondary Index
In the sparse indexing, as the size of the table grows, the size of mapping also grows. These
mappings are usually kept in the primary memory so that address fetch should be faster. Then the
secondary memory searches the actual data based on the address got from mapping. If the
mapping size grows then fetching the address itself becomes slower. In this case, the sparse
index will not be efficient. To overcome this problem, secondary indexing is introduced.
In secondary indexing, to reduce the size of mapping, another level of indexing is introduced. In
this method, the huge range for the columns is selected initially so that the mapping size of the
first level becomes small. Then each range is further divided into smaller ranges. The mapping of
the first level is stored in the primary memory, so that address fetch is faster. The mapping of the
second level and actual data are stored in the secondary memory (hard disk).
For example:
o If you want to find the record of roll 111 in the diagram, then it will search the highest
entry which is smaller than or equal to 111 in the first level index. It will get 100 at this
level.
o Then in the second index level, again it does max (111) <= 111 and gets 110. Now using
the address 110, it goes to the data block and starts searching each record till it gets 111.
o This is how a search is performed in this method. Inserting, updating or deleting is also
done in the same manner.