Automatic Speech Recognition

Automatic Speech Recognition (ASR) translates spoken language into text, helping users interact with devices hands-free. ASR models analyze audio, convert it into phonetic sequences, and map those to words using linguistic models. Early systems used Hidden Markov Models and Gaussian Mixture Models. Modern ASR employs deep learning architectures, such as recurrent neural networks (RNNs) and transformers, which deliver higher accuracy in real-world environments. ASR systems face challenges like background noise, speaker variability, and different accents. Advances in noise reduction and speaker adaptation are making these systems more robust for various applications. ASR is widely used in virtual assistants, transcription tools, call centers, and accessibility solutions, including automatic captioning for videos and events. Ongoing research explores multilingual ASR and end-to-end systems that process raw audio directly, expanding the scope of ASR in diverse applications.

ACL6060 test set

Collection of ACL 2022 paper presentations for which pre-recorded audio or video presentations were provided to the ACL Anthology. Presentations include a variety of native and non-native English accents. Presentations have been professionally transcribed and translated into ten language pairs, including 4 European languages (German, Portuguese, Dutch, and French).  The dataset was described in detail in “Elizabeth Salesky, Kareem Darwish, Mohamed Al-Badrashiny, Mona Diab, and Jan Niehues”, 2023, Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Terminology, in Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 62–78, Toronto, Canada, Association for Computational Linguistics publication.

Elizabeth Salesky, Kareem Darwish, Mohamed Al-Badrashiny, Mona Diab, Jan Niehues”, 2023, Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Terminology, in Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 62–78, Toronto, Canada, Association for Computational Linguistics.





Automatic Speech Recognition

Language en
en

MUSTC test set

MuST-C is a large and freely available Multilingual Speech Translation Corpus built from English TED Talks. Its unique features include: i) language coverage and diversity (from English into 14 languages from different families), ii) size (at least 237 hours of transcribed recordings per language, 430 on average), iii) variety of topics and speakers, and iv) data quality. The audio recordings from English TED Talks are automatically aligned at the sentence level with their manual transcriptions and translations. The MuST-C corpus is available to download for research purposes under a Creative Commons Attribution 4.0 International License. The dataset is the English component of the MuST-C v1.3 en-de, tst-COMMON set.

Roldano Cattoni, Mattia Antonino Di Gangi, Luisa Bentivogli, Matteo Negri, Marco Turchi. 2020, MuST-C: A multilingual corpus for end-to-end speech translation, In Computer Speech & Language Journal, Volume 66, March 2021

Automatic Speech Recognition

Language en
en

MTEDX test set

The corpus comprises audio recordings and transcripts from TEDx Talks in 8 languages, including 6 European languages (Spanish, French, Portuguese, Italian, Greek, and German), with translations into up to 5 languages, all European languages (English, Spanish, French, Portuguese, Italian). The audio recordings are automatically aligned at the sentence level with their manual transcripts and translations. The mTEDx dataset is available to download for research purposes under a Creative Commons Attribution 4.0 International License

Elizabeth Salesky, Matthew Wiesner, Jacob Bremerman, Roldano Cattoni, Matteo Negri, Marco Turchi, Douglas W. Oard, Matt Post, 2021, Multilingual TEDx Corpus for Speech Recognition and Translation, Proceedings of Interspeech 2021, Brno, Czech Republic

Automatic Speech Recognition

Language es fr it pt
es
fr
it
pt

DIPCO test set

DiPCO (CDLA permissive license 2.0 https://cdla.dev/) is a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table.

 Maarten Van Segbroeck, Zaid Ahmed, Ksenia Kutsenko, Cirenia Huerta, Tinh Nguyen, Björn Hoffmeister, Jan Trmal, Maurizio Omologo, Roland Maas, 2019, DiPCo - Dinner Party Corpus, Proceeding of Interspeech 2020, Shanghai, China.

Automatic Speech Recognition

Language en
en