Machine translation
Machine Translation (MT) involves translating text between languages, aiming to produce accurate and contextually relevant output. Neural MT systems, such as transformer models, have replaced earlier statistical and rule-based approaches by offering more fluent translations. Neural MT can be adapted to specialized fields like legal or medical content, enabling more precise translations. A key challenge lies in handling idiomatic expressions, ambiguous terms, and cultural nuances while maintaining the style of the original text. MT models today support low-resource languages and multilingual translation, including zero-shot translation, where they translate unseen language pairs. These systems have become essential for global communication, e-commerce, and customer support. MT continues to evolve, with research focusing on real-time applications, better handling of domain-specific terms, and multimodal translation that considers both text and visual data.
ACL6060 test set
Collection of ACL 2022 paper presentations for which pre-recorded audio or
video presentations were provided to the ACL Anthology.
Presentations include a variety of native and non-native English accents.
Presentations have been professionally transcribed and translated into ten
language pairs, including 4 European languages (German, Portuguese, Dutch,
and French). The dataset was described in detail in “Elizabeth Salesky,
Kareem Darwish, Mohamed Al-Badrashiny, Mona Diab, and Jan Niehues”, 2023,
Evaluating Multilingual Speech Translation under Realistic Conditions with
Resegmentation and Terminology, in Proceedings of the 20th International
Conference on Spoken Language Translation (IWSLT 2023), pages 62-78,
Toronto, Canada, Association for Computational Linguistics publication.
Elizabeth Salesky, Kareem Darwish, Mohamed Al-Badrashiny, Mona Diab,
Jan Niehues”, 2023, Evaluating Multilingual Speech Translation under
Realistic Conditions with Resegmentation and Terminology, in Proceedings of
the 20th International Conference on Spoken Language Translation
(IWSLT 2023), pages 62-78, Toronto, Canada, Association for Computational
Linguistics.
FLORES test set
FLORES+ is a multilingual machine translation benchmark released under CC
BY-SA 4.0. This dataset was originally released by FAIR researchers at Meta
under the name FLORES. The + has been added to the name to disambiguate
between the original datasets and this new actively developed version.
The data consists of translations primarily from English into around 200
language varieties. The original English sentences were sampled in equal
amounts from Wikinews (an international news source), Wikijunior (a
collection of age-appropriate non-fiction books), and Wikivoyage (a travel
guide).
For each of the eight language pairs, the devtest split has been utilized.
NLLB Team, Marta R. Costa-jussa, James Cross, Onur Celebi, Maha Elbayad,
Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht,
Jean Maillard, Anna Sun, Skyler Wang, Guillame Wenzek, Al Youngblood, Bapi
Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John
Hoffman, Samarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon
Spruit, ChauTran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale,
Sergey Edunov, Angela Fan, Cynthia Gao, Vedanui Goswami, Francisco Guzman,
Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem,
Holger Schwenk, Jeff Wang, 2024, Scaling neural machine translation to 200
languages, Nature 630, 841--846 (2024).