SPOKENSQUAD

Spoken-SQuAD is a spoken question answering dataset built on top of the SQuAD dataset and released under the license CC-BY-SA-4.0. In Spoken-SQuAD, the document is in spoken form, the input question is in the form of text and the answer to each question is always a span in the document. The spoken documents were generated from SQuAD textual articles using a Google text-to-speech system. In addition, corresponding automatic transcripts were generated using CMU Sphinx. The questions were left in text form. The SQuAD training set was used to generate the training set of Spoken-SQuAD, and the SQuAD development set was used to generate the testing set for Spoken-SQuAD. All the question-answer pairs for which the answer did not exist in the ASR transcriptions of the associated article were removed. The dataset is the dev split.

Chia-Hsuan Li,Szu-Lin Wu,Chi-Liang Liu,Hung-yi Lee, 2018, Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension, Proceedings of Interspeech 2018, Hyderabad, India

Speech Question-Anwering

Language en
en