20BCE1779 - Web Mining - Lab-1

Name :- Patel Vedant Vikrambhai
R.g.no :- 20BCE1779
Sub :- Web Mining
Code :- CSE3024
Lab :- 1
1) Unix GREP command
Explore UNIX GREP Command

The grep filter searches a file for a particular pattern of characters, and displays all lines that contain
that pattern. The pattern that is searched in the file is referred to as the regular expression(grep stands
for globally search for regular expression and print out).
ALGORITHM
1. Open the UNIX/LINUX system.
2. Use the grep command to search for text in files.
3. Verify the result of the output.
OUTPUT :-
Ignores, case for matching
This prints out all the lines that do not match the pattern
Print only the matched parts of a matching line, with each such part on a separate output line.
Display the matched lines and their line numbers.
2) Boolean retrieval model

Write a program to create the inverted index and execute for the following document
collections.
a)
Doc 1 new home sales top forecasts

Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
b)
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
c) Generate term-document incidence matrix for a) and b).
d) For the document collection shown in (b), compute the results for these queries:
i) schizophrenia AND drug
ii) for AND NOT(drug OR approach)
Explanation 2 (a) and (b)
ASSUMPTIONS
1. The initial assumption is that the documents are given in the form a list, where each
element is a document containing all the text.
2. This means that it can be a file name if the amount of text is very large.
3. In this case as the amount of text is small, it is stored as a string and hence the listcontains
the variable names of these document strings.
DATA STRUCTURES
List
HashMap
Set
ALGORITHM
1. The function receives the list containing the documents.
2. Then iterate over this document list, one at a time.
3. The words contained in the documents are extracted in a list. This is done using a simple
split () function. Ideally one should use NLTK library as it will give correct and accurate
list of words which are syntactically and semantically accurate.
4. The words contained in that document is then iterated through, words are grouped with
the document number to create a tuple. This is inserted in the words_list[]
5. This creates a global list of words which has tuples containing (term,document_id)
6. This list is then converted to set, to remove repetitive elements.
7. The set is then sorted into a new list alphabetically.
8. After this a dictionary (hash_map) is created, the idea is that, term will serve as the key
and the value will be a tuple which contains the list of documents and the frequency.
9. The list will grow dynamically, we can also use linked list, if we want certain extra
description.
10. The text suggests storing this hash map values in hard disk, as this will tend to be very
large file. Hence we can store this in hard disk.
11. In this use case the list () can be stored in the memory.
12. Creating the dictionary is done by iterating through all the tuples in our word list.
13. Each time a term is encountered the associated document id is append to the list
associated with the term, frequency value is also increased.
14. Then the dictionary terms postings list is sorted.
CODE AND OUTPUTS :-

3) Inverted index :-
ALGORITHM
1. The query generated is input.
2. Manually deciphered the meaning of the query.
3. In future we can convert the text into its appropriate meaning, for
example AND can beunderstood as the bitwise ‘&’ this will help us
encapsulate this part as a modular function.
4. Using the term-document incidence matrix. The list of values
associated with that term ispassed to a function.
5. In the function is list iterated to convert it into a suitable number.
6. The bitwise operation is performed on the returned values based on the query.
7. After this the result is converted to binary string, with leading 0
padding in case some ofthe initial positions become 0.
8. After this the string is passed to a function, where the string index
having value as 1corresponds to the satisfiability of the query for
the associated document.
9. This prints out all the relevant documents fulfilling the condition.
CODE AND OUTPUTS :-

20BCE1779 - Web Mining - Lab-1

Uploaded by

Copyright:

Available Formats

20BCE1779 - Web Mining - Lab-1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

20BCE1779 - Web Mining - Lab-1

Uploaded by

Copyright:

Available Formats

Name :- Patel Vedant Vikrambhai

Explore UNIX GREP Command

Display the matched lines and their line numbers.

2) Boolean retrieval model

Doc 1 new home sales top forecasts

Doc 1 breakthrough drug for schizophrenia

Doc 2 new schizophrenia drug

Doc 3 new approach for treatment of schizophrenia

Doc 4 new hopes for schizophrenia patients

c) Generate term-document incidence matrix for a) and b).

i) schizophrenia AND drug

ii) for AND NOT(drug OR approach)

Explanation 2 (a) and (b)

CODE AND OUTPUTS :-

You might also like