4

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

In neural Information Retrieval, ongoing research is directed towards improving the first retriever in ranking pipelines. Learning dense embeddings to conduct retrieval using efficient approximate nearest neighbors methods has proven to work well. …

To Beam Or Not To Beam: That is a Question of Cooperation for Language GANs

Due to the discrete nature of words, language GANs require to be optimized from rewards provided by discriminator networks, via reinforcement learning methods. This is a much harder setting than for continuous tasks, which enjoy gradient flows from …

ColdGANs: Taming Language GANs with Cautious Sampling Strategies

Training regimes based on Maximum Likelihood Estimation (MLE) suffer from known limitations, often leading to poorly generated text sequences. At the root of these limitations is the mismatch between training and inference, i.e. the so-called …

MLSUM: The Multilingual Summarization Corpus

We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with …

Beyond Cumulated Gain and Average Precision: Including Willingness and Expectation in the User Model

In this paper, we define a new metric family based on two concepts: The definition of the stopping criterion and the notion of satisfaction, where the former depends on the willingness and expectation of a user exploring search results. Both concepts …

Diversification Based Static Index Pruning - Application to Temporal Collections

Nowadays, web archives preserve the history of large portions of the web. As medias are shifting from printed to digital editions, accessing these huge information sources is drawing increasingly more attention from national and international …

Efficient Document Indexing Using Pivot Tree

We present a novel method for efficiently searching top-k neighbors for documents represented in high dimensional space of terms based on the cosine similarity. Mostly, documents are stored as bag-of-words tf-idf representation. One of the most used …

Exploring a Multidimensional Representation of Documents and Queries (extended version)

In Information Retrieval (IR), whether implicitly or explicitly, queries and documents are often represented as vectors. However, it may be more beneficial to consider documents and/or queries as multidimensional objects. Our belief is this would …

Learning Term Weights for Ad-hoc Retrieval

Most Information Retrieval models compute the relevance score of a document for a given query by summing term weights specific to a document or a query. Heuristic approaches, like TF-IDF, or probabilistic models, like BM25, are used to specify how a …

Parameterized Neural Network Language Models for Information Retrieval

Information Retrieval (IR) models need to deal with two difficult issues, vocabulary mismatch and term dependencies. Vocabulary mismatch corresponds to the difficulty of retrieving relevant documents that do not contain exact query terms but …