GBX9MO23 - Information Access and Retrieval 2021-2022

Available from Master 2 MOSIG and MSIAM. Course Description

This course is given by Jean-Pierre Chevallet, Philippe Mulhem, Lorraine Goeuriot and Georges Quénot from the Multimedia Information Modeling and Retrieval (MRIM) research group of the Grenoble Informatics Laboratory (LIG).

Contact: Georges.Quenot@imag.fr

Contents / schedule:

Part I. Foundations of Information Retrieval (Jean-Pierre Chevallet and Philippe Mulhem)

Part II: Web, social networks and health (Philippe Mulhem, Lorraine Goeuriot)

Part III: Multimedia indexing and retrieval (Georges Quénot)

Reference to IR books or papers

Practical

The goal of this practical is to implement in Python a skeleton of an IR system. The fundamental structure to implement is a inverted file that contents for each term, a document internal Id (integer) and the frequency (integer) of this term in this document.

You must accept the following implementation constraints to reduce the size of this inverted file:

  • Document Id stored on a 4 bytes unsigned integer (32 bits, max 4 millions of indexed documents)
  • Frequency stored on only 1 byte unsigned integer (8 bits, max frequency of 256)

For that, use the python basic and efficient array structure. This is the only imposed data structure for this project.

The inverted file can be loaded in memory for more efficiency when searching. Beside that, you have to store document external Id in a sequential structure that can be saved. Also, the dictionary should be programmed using a basic python structure, that can be saved.

Optionally, you can used the following libraries:

  • The nltk.corpus part of the Natural Language Toolkit for reading the corpus and for the stopwords
  • The nltk.stem for stemmers like snowball.

For the data set, you can use the following test collection (document and solved query) :

The minimum matching model to implement are the Vector Space Model and one simple Language Model.

You can do you projet up to 3 persons and you have to send the following elements packaged in a compressed file:

  • The full source code in Python (without the cache .pyc files to reduce size)
  • A minimal documentation with the name of each member of the group and explanation on how to use your system and of on witch test collection it works, witch IR model, etc.

Please do not include any data to reduce the file final size. Produce a minimalist short code: the goal of this project is to better understand IR system real working.

Then send the result to : jean-pierre.chevallet@univ-grenoble-alpes.fr, before the official exam period (around end of January 2022. Meanwhile, you can send technical question to jean-pierre.chevallet@univ-grenoble-alpes.fr.

First session examination

The examination will be on February 2nd, 2022 from 9:45 to 11:45am in ENSIMAG Amphi E.
Course materials, the two papers related to the examinations, personal notes, and calculators (without network capabilities) are allowed.

You are expected to do a research work on the two papers proposed below, in a way to understand them and to be able to comment then. You will have to answer questions on topics that occur in the lessons. You must also take time to read complementary information in order to understand the papers. Be sure to bring with you a copy of the two research papers as they will NOT be redistributed with the examination subject. These can be annotated by you. The bibliography and appendices, if any, are part of the papers.

The papers for the 2021/2022 exam are:

Previous years examinations

2021-2022_page.txt · Last modified: 2022/12/12 09:16 by quenot
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Share Alike 4.0 International
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki