During my phD thesis, I worked on Structured Information Retrieval and Machine Learning. Ludovic Denoyer and I developped a Bayesian Network framework for Information Access. Ludovic focused on classification while I focused on Information Retrieval.
In information retrieval, using the structure can lead to better retrieval performances. This was shown in the context on Web information retrieval on a structured site.
We can also use structure for new tasks: with XML documents, we can return the document as the answer to a given query, but we can also return a single paragraph or a section, and even a bibliographic reference. The document is therefore not anymore the only atomic unit for information retrieval. The Bayesian Networks can be adapted to find the smallest information units relevant to an user information need (Bayesian networks and INEX). The framework is fully described, and experimentation on the INEX corpus provided, in (Piwowarski & Gallinari, 2005).
This new retrieval paradigm implies to change the way systems are evaluated. In INEX (see below), a new assessment scale has been proposed along with new precision/recall metrics (the Norbert Gövert PRng metric for instance). I proposed a several metrics, the latest being the most expressive (generalisation of precision-recall) and simple to compute (missing reference). I co-authored a review on metrics in 2009 (missing reference).
The new measure and the Bayesian Network framework are fully described in my phD thesis. I also developped a new algebra (missing reference) for queries that mix content and structure constraints (e.g. "find a paragraph about XML in a book about information retrieval"). This algebra will be used for the next INEX initiative (december 2004).
I participated to the INEX initiative in 2002 and 2003. The aim of this initiative is to provide means, in the form of a large testbed (test collection) and appropriate scoring methods, for the evaluation of retrieval of XML documents. I developped in 2003 the interface that allows participants to judge the answers given by the different systems provided by the INEX participants. This interface is very specific to structured documents, as it allows one to judge any XML element within the document. It provides some help during this process: keyword highlighting, eye friendly display of documents, and - this is the most important part - consistency and exhaustivity check. The consistency checking ensure that constraints between judged elements are respected. Exhaustivity ensure that a good part of the document have been assessed.
A report on the assessment procedure and the x-rai interface can be found in (missing reference)
- Piwowarski, B., & Gallinari, P. (2005). A Bayesian Network for XML Information Retrieval: Searching and Learning with the INEX Collection. In Proceedings of the Initiative for the Evaluation of XML Retrieval (pp. 655–681). https://doi.org/10.1007/s10791-005-0751-6