Faculty of Engineering, Environment and Computing 7071CEM Assignment Brief Jan-May 2021

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Faculty of Engineering, Environment and

Computing 7071CEM

Assignment Brief Jan- May 2021

Module Title Ind/Group: Cohort Module Code


Information Retrieval Individual Jan- May 7071CEM
Coursework Title (e.g. CWK1) Hand out date:
CW 03.03.2021
Lecturer Due date:
Seyed Mousavi 02.04.2021 (plus the
university automatic
two-week extension)
Estimated Time (hrs): Coursework type: % of Module Mark
50 Project/Development 66.7
Word Limit*:
N/A

Submission arrangement online via Aula: x


File types and method of recording: pdf or docx
Mark and Feedback date: 10 days after submission
Mark and Feedback method: via Aula

Module Learning Outcomes Assessed:

Module Learning Outcomes 1-5:

1. Demonstrate a sound knowledge of information retrieval principles


2. Apply main data structures used in index construction in Python or a similar high- level
language
3. Implement a typical web crawler and query processor, in Python or a similar high-level
language
4. Acquire knowledge and skills to apply common machine learning methods for text
classification and document clustering
5. Build the outline of a minimum viable vertical search engine for text retrieval

Task:

Develop a vertical search engine similar to Google Scholar that only retrieves papers/books
published by a member of Coventry University. That is, at least one of the co-authors must be
from CU. To that end, you crawl Google Scholar profiles of academic staff at CU and index their
papers in their profiles. The seed page for your crawler, i.e. the first page to crawl, is the Google
Scholar page for Coventry University:
https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779

Your system crawls this page and the links provided for each member of staff there to access
their Google Scholar profiles. Then for each profile, it goes through the publications and
construct the inverted index using the information about those publications. Because of low
rate of changes to this information, your crawler may be scheduled to look for new
information, say, once per week, but it should ideally be able to do so automatically, as a
scheduled task.

From the user’s point of view, your system has an interface that is similar to the Google Scholar
main page, where the user can type in their queries/keywords about the resources they want
to find. Then, your system will display the results, sorted by relevance, in a similar way Google
Scholar does. However, only publications with at least one co-author from CU are retrieved.
You may further specialise your search engine to a specific field, e.g., computer science,
mechanical engineering, bioinformatics or whatever you would like.

In addition, whether as a separate program or integrated with search engine, a subject


classification functionality is needed. More specifically, the input is a scientific text and the
output is its subject among zero or more of the cases: Health, Engineering, Business, Art.

You can use any general purpose programming language of your choice although Python is
recommended because of its rich library and sample codes developed in the labs.

In case of ambiguity, make reasonable assumptions and/or let me know.

Requirements and Markings


Requirement Mark%
Fully working crawler component 25
Must completely crawl the CU profile in Google Scholar and subsequent links in BFS manner. It
should be scheduled to re-crawl to extract new data automatically.
Construction of Inverted Index 15
Construction of the index based on appropriate data structures studied in the module as opposed
to naive database tables. The index should be updated once new data are received from the
crawler component.
Fully Working query processor component 25
Displaying results relevant to given queries
Fully working subject classification component 20
Given the search keyword/text, the system should identify its relevant subject(s), if any, among
the predefined subjects.
Overall usability 15
Acceptable response time, accuracy of results, nice interface, readable results sorted by
relevance, subject classification and anything else that might affect the usability of the product.
Based on the above marking scheme, a typical system expected for a mark of 40 or more is a
working search engine which accepts users’ queries/keywords and displays relevant results, i.e.
publications by CU staff in relevant fields. However, it may not have proper inverted index and
subject classification components and may be slow and inaccurate in some cases.

To earn 70 or more, the system is expected to be a working search engine with reasonable
accuracy and speed. This ensures that the system contains fully working crawler and query
processor components. In addition, it must have at least one, and preferably both, of the other
two components, i.e. the inverted index and the text classification components, in fully working
status.

Please note that to show that your system meets each of the above-mentioned requirements,
your report must provide sufficient evidence including clear description, complete source code,
and complete screenshots where applicable.

Notes:  

1. You are expected to use the  Coventry University APA style for referencing. For support and
advice on this students can contact Centre for Academic Writing (CAW).  
2. Please notify your registry course support team and module leader for disability support.  
3. Any student requiring an extension or deferral should follow the university process as outlined
here.   
4. The University cannot take responsibility for any coursework lost or corrupted on disks, laptops
or personal computer. Students should therefore regularly back-up any work and are advised to
save it on the University system.  
5. If there are technical or performance issues that prevent students submitting coursework
through the online coursework submission system on the day of a coursework deadline, an
appropriate extension to the coursework submission deadline will be agreed. This extension will
normally be 24 hours or the next working day if the deadline falls on a Friday or over the
weekend period. This will be communicated via your Module Leader.  
6. You are encouraged to check the originality of your work by using the draft Turnitin links on
Aula.  
7. Collusion between students (where sections of your work are similar to the work submitted by
other students in this or previous module cohorts) is taken extremely seriously and will be
reported to the academic conduct panel. This applies to both courseworks and exam answers.  
8. A marked difference between your writing style, knowledge and skill level demonstrated in class
discussion, any test conditions and that demonstrated in a coursework assignment may result in
you having to undertake a Viva Voce in order to prove the coursework assignment is entirely
your own work.  
9. If you make use of the services of a proof reader in your work you must keep your original
version and make it available as a demonstration of your written efforts.   
10. You must not submit work for assessment that you have already submitted (partially or in full),
either for your current course or for another qualification of this university, with the exception
of resits, where for the coursework, you maybe asked to rework and improve a previous
attempt.   This requirement will be specifically detailed in your assignment brief or specific
course or module information. Where earlier work by you is citable, i.e. it has already been
published/submitted, you must reference it clearly.  Identical pieces of work submitted
concurrently may also be considered to be self-plagiarism.  

Mark allocation guidelines to students


0-39 40-49 50-59 60-69 70+ 80+
Work mainly Most elements Most elements Strengths in all Most work All work
incomplete completed; are strong, elements exceeds the substantially
and /or weaknesses minor standard exceeds the
weaknesses outweigh weaknesses expected standard
in most areas strengths expected

You might also like