====== GBX9MO23 - Information Access and Retrieval 2021-2022 ====== Available from Master 2 [[http://mosig.imag.fr/ProgramEn/M2S1|MOSIG]] and [[https://iam.imag.fr/m2tracks#data_science_ds|MSIAM]]. [[http://formations.univ-grenoble-alpes.fr/en/catalog/master-s-degree-XB/sciences-technologies-and-health-STS/master-in-computer-science-program-program1-master-informatique-en/master-of-science-in-informatics-at-grenoble-mosig-subprogram-subprogram-master-of-science-in-informatics-at-grenoble-mosig-en/ue-information-access-and-retrieval-IGDF9Z8M.html|Course Description]] This course is given by [[http://mrim.imag.fr/User/jean-pierre.chevallet/|Jean-Pierre Chevallet]], [[http://lig-membres.imag.fr/mulhem/|Philippe Mulhem]], [[http://mrim.imag.fr/User/lorraine.goeuriot/|Lorraine Goeuriot]] and [[http://lig-membres.imag.fr/quenot/|Georges Quénot]] from the [[http://lig-mrim.imag.fr/|Multimedia Information Modeling and Retrieval]] (MRIM) research group of the [[https://www.liglab.fr/en|Grenoble Informatics Laboratory]] (LIG). Contact: [[mailto:Georges.Quenot@imag.fr"|Georges.Quenot@imag.fr]] Contents / schedule: Part I. Foundations of Information Retrieval (Jean-Pierre Chevallet and Philippe Mulhem) * Course 1: {{m2r_mosig_ira_chapter_01_information_retrieval_basics.pdf|Information retrieval basics}} (J.-P. Chevallet). * Course 2: {{m2r_mosig_iar_chapter_02_classical_models_for_information_retrieval.pdf|Classical models for information retrieval}} (J.-P. Chevallet). * Course 3: Vector space model and embeddings (J.-P. Chevallet) * Course 4: {{ :probabilistic_information_retrieval_2020.pdf |Probabilistic IR Models (version for 2020)}} (P. Mulhem).{{ ::exercices_lm_-_correction.pdf | 2 exercices with corrections}} Part II: Web, social networks and health (Philippe Mulhem, Lorraine Goeuriot) * Course 5: {{ ::web_ir_2021.pdf |Web information retrieval (NEW: 2021 version)}} (P. Mulhem). {{ ::hits-pr.xlsx | Excel document with the computations of Hits and Pagerank values of the slides}}. {{ ::log_regression_ex.xlsx | Excel document with the example of logistic regression for Learning to Rank of the slides}}. * Course 6: {{ ::evaluation_of_information_retrieval_-_2021.pdf | IR evaluation}} (2021 version, last update 23rd of November 2021) (P. Mulhem). {{ :computing_eval_examples.xlsx |Excel document with examples of Recall/precision tables and diagrams, and nDCG}}. * Course 7: Following of course 6. * Course 8: {{::medical-ir.pdf|Medical information retrieval}} (L. Goeuriot). Part III: Multimedia indexing and retrieval (Georges Quénot) * Mathematics Reminders: {{:mr1.pdf |Linear algebra and convolutions}} (G. Quénot). * Mathematics Reminders: {{:mr2.pdf |Differential calculus}} (G. Quénot). * Course 9: {{:m2-mosig-iar-9.pdf|Visual content representation and retrieval}} (only introduction and color / texture / points of interest descriptors) (G. Quénot). * Course 10: {{:m2-mosig-iar-10.pdf|Classical machine Learning for multimedia indexing}} (part on LSH is excluded) (G. Quénot). * Course 11: {{:m2-mosig-iar-11.pdf|Deep learning for multimedia indexing and retrieval - part 1}} (G. Quénot). * Course 12: {{:m2-mosig-iar-12.pdf|Deep learning for multimedia indexing and retrieval - part 2}} (G. Quénot). Reference to IR books or papers * [[http://nlp.stanford.edu/IR-book/|Introduction to Information Retrieval, http://nlp.stanford.edu/IR-book/]] * [[https://ciir.cs.umass.edu/irbook/|Search Engines Information Retrieval in Practice, https://ciir.cs.umass.edu/irbook/]] ===== Practical ===== The goal of this practical is to implement in Python a skeleton of an IR system. The fundamental structure to implement is a inverted file that contents for each term, a document internal Id (integer) and the frequency (integer) of this term in this document. You must accept the following implementation constraints to reduce the size of this inverted file: * Document Id stored on a 4 bytes unsigned integer (32 bits, max 4 millions of indexed documents) * Frequency stored on only 1 byte unsigned integer (8 bits, max frequency of 256) For that, use the python basic and efficient [[https://docs.python.org/3/library/array.html|array]] structure. This is the only imposed data structure for this project. The inverted file can be loaded in memory for more efficiency when searching. Beside that, you have to store document external Id in a sequential structure that can be saved. Also, the dictionary should be programmed using a basic python structure, that can be saved. Optionally, you can used the following libraries: * The [[https://www.nltk.org/api/nltk.corpus.html|nltk.corpus]] part of the Natural Language Toolkit for reading the corpus and for the stopwords * The [[https://www.nltk.org/api/nltk.stem.html|nltk.stem]] for stemmers like [[https://www.nltk.org/api/nltk.stem.snowball.html|snowball]]. For the data set, you can use the following test collection (document and solved query) : * The data ressources available from [[http://ir.dcs.gla.ac.uk/resources/|Glasgow University IR Research Group]] like: * [[http://ir.dcs.gla.ac.uk/resources/test_collections/cacm/|CACM]] * [[http://ir.dcs.gla.ac.uk/resources/test_collections/cran/|Cranfield]] * ... * Or other ressources like [[http://www.daviddlewis.com/resources/testcollections/reuters21578/|reuters21578]] The minimum matching model to implement are the Vector Space Model and one simple Language Model. You can do you projet up to 3 persons and you have to send the following elements packaged in a compressed file: * The full source code in Python (without the cache .pyc files to reduce size) * A minimal documentation with the name of each member of the group and explanation on how to use your system and of on witch test collection it works, witch IR model, etc. Please do not include any data to reduce the file final size. Produce a minimalist short code: the goal of this project is to better understand IR system real working. Then send the result to : jean-pierre.chevallet@univ-grenoble-alpes.fr, before the official exam period (around end of January 2022. Meanwhile, you can send technical question to jean-pierre.chevallet@univ-grenoble-alpes.fr. ===== First session examination ===== The examination will be on **February 2nd, 2022 from 9:45 to 11:45am** in ENSIMAG **Amphi E**. \\ Course materials, the two papers related to the examinations, personal notes, and calculators (without network capabilities) are allowed. You are expected to do a research work on the two papers proposed below, in a way to understand them and to be able to comment then. You will have to answer questions on topics that occur in the lessons. You must also take time to read complementary information in order to understand the papers. **Be sure to bring with you a copy of the two research papers as they will NOT be redistributed with the examination subject.** These can be annotated by you. The bibliography and appendices, if any, are part of the papers. The papers for the 2021/2022 exam are: * B. Mansouri, R. Zanibbi, D. W. Oard, Learning to Rank for Mathematical Formula Retrieval, ACM SIGIR '21, https://www.cs.rit.edu/~rlaz/files/LTR_Formulas_SIGIR2021.pdf * Jiaxin Wu and Chong-Wah Ngo, Interpretable Embedding for Ad-Hoc Video Search, ACM Multimedia 2020, http://vireo.cs.cityu.edu.hk/papers/MM2020_dual_task_video_retrieval.pdf ==== Previous years examinations ==== 2017-2018 examination: {{ :gbx9mo23-2017-2018-exam.pdf |}}, papers: \\ https://www.researchgate.net/publication/305081616_A_Simple_Enhancement_for_Ad-hoc_Information_Retrieval_via_Topic_Modelling, \\ [[http://www.tyr.unlu.edu.ar/tallerIR/2014/papers/novel-tfidf.pdf]], \\ [[http://openaccess.thecvf.com/content_cvpr_2017/papers/Huang_Densely_Connected_Convolutional_CVPR_2017_paper.pdf]] 2018-2019 examination: {{ :gbx9mo23-2018-2019-exam.pdf |}}, papers: \\ [[https://danluu.com/bitfunnel-sigir.pdf]], \\ [[https://arxiv.org/pdf/1604.01325]]. 2019-2020 examination: {{ :gbx9mo23-2019-2020-exam.pdf |}}, papers: \\ [[http://openaccess.thecvf.com/content_CVPR_2019/papers/Dong_Dual_Encoding_for_Zero-Example_Video_Retrieval_CVPR_2019_paper.pdf]] \\ [[https://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=1302]] 2020-2021 examination: {{ :gbx9mo23-2020-2021-exam-tmpkxtz.pdf |}}, papers: \\ [[https://arxiv.org/pdf/1707.05612]] \\ [[https://people.cs.umass.edu/~elm/papers/zamani.pdf]] [[2021-2022 page]]