Bits Pilani, Dubai Campus

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

CSF469Information Retrieval Question Paper07.06.

2020
BITS PILANI, DUBAI CAMPUS
Dubai International Academic City
Second Semester 2019 – 2020
Comprehensive Examination (Closed Book)(5 Pages)
Year : B.E. Date:07.06.2020FN
Course No : CS F469 MAX Marks: 40(40%)
Course Title : Information Retrieval Duration:3 Hours
Part A MCQs 10*0.5=5M
Answer all questions
1) Each posting in a positional index is:
A) a position-ID and a list of bi-words
B) a doc-ID and a list of positions
C) a position-ID and a list of documents
D) a doc-ID and a list of permutation terms

2) The sound-ex code for the input string BYis given by:
A) B000
B) B001
C) B100
D) None of the above.

3)

4) In the URL Frontier of Mercator Scheme, the Back Queues

A) Handle Connectivity Queries B) Manage Prioritization C) Enforce Politeness D)


Deploy Gap Encoding

5) The Recommender System, that uses one technique to generate an output, which in turn is
used as an input to the second recommendation technique is called:
A) Parallelized Hybridization B) Switching
C) Feature Augmentation D) None of the above

6)A Topic Model discovers topics across various text documents. It deploys

A)backtracking technique B)supervised learning technique


C) dynamic programming technique D)unsupervised learning technique

7) A good clustering method will produce high quality clusters with:


A)high intra-class similarity & low inter-class similarity
B) low intra-class similarity &high inter-class similarity
C) low intra-class similarity &low inter-class similarity
D) high intra-class similarity & high inter-class similarity

8) TheSVM’s are less effective when:

1
CSF469Information Retrieval Question Paper07.06.2020
A)The data are linearly separable
B) The data are noisy and contain overlapping points
C) The data are clean and ready to use
D)None of the above

9. The Naive Bayes Algorithm for Text Classificationuses _____________ to make class
predictions.

A) distributed architecture B) differential equations


C) conditional probability D)pseudo random numbers

10.Consider the following lexicalized subtrees for a query (q 4 ) and a document (d 2 ), in the
context of XML Retrieval.

The context resemblance function is equal to:


A)0.25 B) 0.75
C) 0.50 D)None of the above.

2
CSF469Information Retrieval Question Paper07.06.2020

Part B Short Answers. (5 * 2 = 10 M)


Answer all questions.
1. Consider the following 2 disk blocks having the following posting lists:

Disk Block 1 Disk Block 2


Term Doc IDs Term Doc IDs
Brain D1, D3 Brain D6, D7
Cap D1, D2, D4 Cap D8, D10
North D5 Jar D9
Wide D1, D2, D4, D5 King D8

Now, write down the merged postings lists using the Block Sort Based Indexing algorithm (BSBI).

2.Compute the measures PRECISION and RECALL for the following IR application involving text documents:

Number of Documents
RELEVANT NON-RELEVANT
RETRIEVED 60 70
NOT RETRIEVED 160 810

3. Convert the decimal number 525 to an equivalent Gamma code.

4.In Cross Language Information Retrieval, what are the different types of Bilingual Corpora?

5. Discuss in brief the dimensionality reduction technique “Missing Value Ratio”. Illustrate with an example.

Part C (Descriptive, Numerical, Application of Concepts). (5 * 5 = 25 M)


Answer all questions.

1. Draw the inverted index that would be built for the following document collection:

Doc 1 PIN FIND BEAR BISON


Doc 2 BEAR BISON MAN
Doc 3 MAN PIN FIND BEAR
Doc 4 PIN FIND MAN BISON BEAR
Doc 5 PIN FIND BEAR
Doc 6 MAN BISON MAN BISON
Doc 7 BEAR MANBISON

3
CSF469Information Retrieval Question Paper07.06.2020

2.MULTIMEDIA INFORMATION RETRIEVAL FROM A DISTRIBUTED MULTIMEDIA DATABASE SYSTEM

It is proposed to develop a Multimedia Information Retrieval System for an International University. The
university is spread over various continents. Each continent has several countries and each country has
several major cities. The university has a branch campus in every major city across the world. The
university intends to provide courses content for access among all its students across the world. The course
content is based on multimedia type (text, image, video, audio, graphical) and faculty members from all
branch campuses are involved in courses content creation. Each Branch Campus hosts content for around
50 courses (there is no overlap). Each course has around 40 topics. For example, “Vector Space Model” can
be a topic in the course “Information retrieval”. A student from any branch campus can retrieve multimedia
information on any specific topic, from one or more branch campuses, based on his/her query. The query
can involve retrieval of data from one or more media types (like audio, image, video, text and graphical).

Answer the following Questions.


1. Write a sample Query (in simple English) to search the database by an input image. (pl. note. you are
searching by content. and not keyword).
2. Write a sample Query that takes a keyword as input and retrieves two media type objects
corresponding to the keyword.
3. Write a sample Query that takes a keyword as input and retrieves any three media type objects
corresponding to the keyword.

4. Write a heterogeneous multimedia Query that uses the principle of “Mix and Match data from three
sources – text, image and video”.

3 ONLINE BOOKSTORE – Books Recommender System

It is proposed to develop a BOOKs Recommender System for an online bookstore that allows a customer
to purchase / order books onlineor over telephone.

The application administrator is responsible for design and maintenance of the above system.
The following aspects are to be considered for the above system:

• User Interface (can be web interface / phone interface)


• Publishers / Suppliers Module
• Shippers / Shipping Facilities
• BOOKSTORE’s MAIN OFFICE
• PAYMENT PROCESSING
• CUSTOMER HISTORY
• BOOKS INVENTORY
As an innovative IR designer, youare required to perform the following:

4
CSF469Information Retrieval Question Paper07.06.2020
a) What Recommendation approach(es) you will follow for the above Books Recommender System?
Justify your answer.
b) Write down the steps involved in building the above recommender system. (diagram not needed. Just
write the steps in plain English sentences).

4. Consider a very small collection C that consists in the following three documents:
• d1: “WESTTRYTINPOT”
• d2: “ WESTPOTTINFIN”
• d3: “WESTTRYPOTTIN”
Given the following query: “POTTINWESTOIL".
Compute the cosine similarity values between each document and the query.

5. a) What are key functions of SEO Software?


b)Jaccard with Shingles
Consider a very small collection C that consists in the following two documents:

D21 : DAY GOLD TEA CUPS TIN

D22 : CUPS GOLD TEA CUPS SUN

Consider the (k = 2)-shingles for each document D21, D22

ComputeJaccard Similarity for the document pair [D21, D22].

=========================================================

5
Question paper - CS F46 Information Retrieval

=============================================
BITS Pilani, Dubai Campus, Academic City, Dubai
II Semester 2019-2020
Degree: B.E. Hons. TEST 2 Question Paper
Course No : CS F469 Course Title: Information Retrieval
Date: 08.04.2020 Wednesday Time: 8.30-9.20 am Total Marks: 20 Weightage: 20%
Data provided are complete. Open Book.
This question paper has 5 questions in 4 pages.
===============================================================
Answer all questions.

1. How will you evaluate any Recommender System? [2 M]

2. Let N = 1,000,0000 represents the number of documents in the collection.


dft is the document frequency, the number of documents that the term t occurs in.
idft is a measure of the informativeness of the term t.
Compute idft for each of the following terms in the table given below: (fill up
blank column)

Term df t idft

calpurnia 1

animal 109

sunday 1009

fly 10,009

under 100,009

the 1,000,009

[3 M]

1
Question paper - CS F46 Information Retrieval

3. Three computers, A, B, and C, have the numerical features listed below:


Feature A B C

Processor Speed 3.15 2.58 2.52


Disk Size 500 200 540
Main-Memory Size 7 4 5

We may imagine these values as defining a vector for each computer; for in-
stance, A’s vector is [3.15, 500, 7]. We can compute the cosine distance between
any two of the vectors, but if we do not scale the components, then the disk
size will dominate and make differences in the other components essentially
in-visible. Let us use 1 as the scale factor for processor speed, α for the disk
size, and β for the main memory size.

(a) In terms of α and β, compute the cosines of the angles between the
vectors for each pair of the three computers.
(b) What are the cosines of the angles between the vectors if α = 0.01 and
β = 0.5? [3+2 M]

2
Question paper - CS F46 Information Retrieval

4. Construct Table II for dictionary-based LZW Compression Algorithm as shown


below (algorithm need not be written; only the table entries are to be filled for
successive steps as necessary).
Let the STRING TABLE (dictionary) initially contains only 3 characters with
codes as shown in Table 1.
Table 1
Code String
1 A
2 B
3 C

If the Input String is ABAABBAACCBB


write the output codes for the above input string.
TABLE II
s c Output code String
1 A
2 B
3 C

(Extend this table with as many rows, if needed) [5 M]

3
Question paper - CS F46 Information Retrieval

5. Compute the page rank for the given scenario iteratively (perform 4 iterations) using
Google's original page rank algorithm.
A, B, C and D refer to 4 web pages. Assume that the damping factor d is 0.72. [5 M]

Iteration PR(A) PR(B) PR(C) PR(D)


0 (initial) 1 1 1 1
(you perform 4 iterations; consider 3 digits after decimal point in results)

****************

4
IR-Question Paper
BITS PILANI, DUBAI CAMPUS
Dubai International Academic City
Second Semester 2019 – 2020
TEST 1 (Closed Book)(seven questions)
Year : III/IV Date:26.02.2020 W1
Course No : CS F469 MAX Marks: 20(20%)
Course Title : Information Retrieval Duration:50 minutes
Answer all questions.
1.What is the difference between 'TOKEN' and 'TYPE' in the context of IR systems? [2 M]

2. a) What is basic principle behind Block Sort Based Indexing (BSBI) Scheme?
b) What is the Time Complexity of Block Sort-Based Indexing Scheme?
[2 M]
3. Do stemming for the following content using PORTER STEMMER and rewrite the final text (that will be
your output): [3 M]

lion coming helping west


climbed formalizing tested well
ignition communicate altered sing

4. Compute Soundex code of the following string: ACCOMMODATION [3 M]

5. Draw the inverted index that would be built for the following document collection: [4 M]

Doc 1 WIND COTKIT


Doc 2 KITCOTINK
Doc 3 RAN WINDCOT
Doc 4 WINDSKYCOT
Doc 5 WINDCOT ORG
Doc 6 SKYINKCOT
Doc 7 KITINKSKY

6.a)Compute the Levenshtein Distance Matrix for computing the edit distance between the following two
strings: GREAT&CREATIONAssume that GREAT is the source string and CREATION is the target string.
[3 M]
6.b.) Identify and list the operations (copy, Replace, insert, delete as applicable) , by backtracking in the
matrix. [1M]

7. a) What are STOP WORDS in the context of Information Retrieval Systems?


b) Mention any two examples of stop words.
[2 M]

**********************

1
BITS PILANI, DUBAI CAMPUS
Dubai International Academic City
Second Semester 2019 – 2020
Quiz (Closed Book) (2 Pages)
Year : IV/III Date: 20.04.20
Course No : CS F469 MAX Marks: 10(10%)
Course Title : INFORMATION RETRIEVAL Duration: 20 minutes

ID NO:_____________________ Name:___________________________

1. Two web search engines A and B each generate a large number of pages uniformly at

random from their indexes. 40% of A’s pages are present in B’s index, while 60% of
B’s pages are present in A’s index. What is the number of pages in A’s index relative
to B’s? [1 M]

2. Why do web crawlers perform URL normalization? [1 M]

3. List any two benefits of robots.txt file? [2 M]

4. Why should the host splitter precede the Duplicate URL Eliminator

in a distributed crawler? [2 M]

5. What are the approaches to sampling URLs in web search? [2 M]

6. List any two benefits of XML / HTML Sitemap w.r.t. SEO


(search engine optimization). [2 M]

You might also like