Word Embeddings and Length Normalization for Document Ranking

Authors: Sannikumar Patel, Markus Hofmann, Kunjan Patel

POLIBITS, Vol. 60, pp. 57-65, 2019.

Abstract: Word embeddings, neural information retrieval, distributed word representations, Word2Vec

Keywords: Distributed word representation techniques have been effectively integrated into the Information Retrieval retrieval task. The most basic approach to this is mapping a document and the query words into the vector space and calculating the semantic similarity between them. However, this has a bias problem towards documents with different lengths, which rank a small document higher compared to documents with larger vocabulary size. While averaging a document by mapping it into vector space, it allows each word to contribute equally, which results in increased distance between a query and the document vectors. In this paper, we propose that document length normalization should be applied to address the length bias problem while using embedding based ranking. Therefore, we have presented an experiment with traditional Length Normalization techniques over, word2vec (Skip-gram) model trained using the TREC Blog06 dataset for ad-hoc retrieval tasks. We have also attempted to include relevance signals introducing a simple Linear Ranking (LR) function, which considers the presence of query words in a document as evidence of relevancy while ranking. Our combined method of Length Normalization and LR significantly increases the Mean Average Precision up to 47% over a simple embeddings based baseline.

PDF: Word Embeddings and Length Normalization for Document Ranking
PDF: Word Embeddings and Length Normalization for Document Ranking



Table of contents of POLIBITS 60