TrainQA: a Training Corpus for Corpus-Based Question Answering Systems

Authors:  David Tomás, José L. Vicedo, Empar Bisbal, Lidia Moreno

Polibits, 40, pp. 5-12, 2009.

Abstract:  This paper describes the development of an English corpus of factoid TREC-like question-answer pairs. The corpus obtained consists of more than 70,000 samples, containing each one the following information: a question, its question type, an exact answer to the question, the different contexts levels (sentence, paragraph and document) where the answer occurs inside a document, and a label indicating whether the answer is correct (a positive sample) or not (a negative sample). For instance, TrainQA can be used for training a binary classifier in order to decide if a given answer is correct (positive) to the question formulated or not (negative). To our knowledge, this is the first corpus aimed to train on every stage of a trainable Question Answering system: question classification, information retrieval, answer extraction and answer validation.

Keywords: Question answering; corpus-based systems

PDF: TrainQA: a Training Corpus for Corpus-Based Question Answering Systems, Alternative link