20BCE1779 - Web Mining - Lab-1
20BCE1779 - Web Mining - Lab-1
20BCE1779 - Web Mining - Lab-1
R.g.no :- 20BCE1779
Sub :- Web Mining
Code :- CSE3024
Lab :- 1
1) Unix GREP command
OUTPUT :-
Ignores, case for matching
This prints out all the lines that do not match the pattern
Print only the matched parts of a matching line, with each such part on a separate output line.
a)
d) For the document collection shown in (b), compute the results for these queries:
ASSUMPTIONS
1. The initial assumption is that the documents are given in the form a list, where each
element is a document containing all the text.
2. This means that it can be a file name if the amount of text is very large.
3. In this case as the amount of text is small, it is stored as a string and hence the listcontains
the variable names of these document strings.
DATA STRUCTURES
List
HashMap
Set
ALGORITHM
1. The function receives the list containing the documents.
2. Then iterate over this document list, one at a time.
3. The words contained in the documents are extracted in a list. This is done using a simple
split () function. Ideally one should use NLTK library as it will give correct and accurate
list of words which are syntactically and semantically accurate.
4. The words contained in that document is then iterated through, words are grouped with
the document number to create a tuple. This is inserted in the words_list[]
5. This creates a global list of words which has tuples containing (term,document_id)
6. This list is then converted to set, to remove repetitive elements.
7. The set is then sorted into a new list alphabetically.
8. After this a dictionary (hash_map) is created, the idea is that, term will serve as the key
and the value will be a tuple which contains the list of documents and the frequency.
9. The list will grow dynamically, we can also use linked list, if we want certain extra
description.
10. The text suggests storing this hash map values in hard disk, as this will tend to be very
large file. Hence we can store this in hard disk.
11. In this use case the list () can be stored in the memory.
12. Creating the dictionary is done by iterating through all the tuples in our word list.
13. Each time a term is encountered the associated document id is append to the list
associated with the term, frequency value is also increased.
14. Then the dictionary terms postings list is sorted.
ALGORITHM
1. The query generated is input.
2. Manually deciphered the meaning of the query.
3. In future we can convert the text into its appropriate meaning, for
example AND can beunderstood as the bitwise ‘&’ this will help us
encapsulate this part as a modular function.
4. Using the term-document incidence matrix. The list of values
associated with that term ispassed to a function.
5. In the function is list iterated to convert it into a suitable number.
6. The bitwise operation is performed on the returned values based on the query.
7. After this the result is converted to binary string, with leading 0
padding in case some ofthe initial positions become 0.
8. After this the string is passed to a function, where the string index
having value as 1corresponds to the satisfiability of the query for
the associated document.
9. This prints out all the relevant documents fulfilling the condition.
CODE AND OUTPUTS :-