Machine translation

Machine Translation (MT) involves translating text between languages, aiming to produce accurate and contextually relevant output. Neural MT systems, such as transformer models, have replaced earlier statistical and rule-based approaches by offering more fluent translations. Neural MT can be adapted to specialized fields like legal or medical content, enabling more precise translations. A key challenge lies in handling idiomatic expressions, ambiguous terms, and cultural nuances while maintaining the style of the original text. MT models today support low-resource languages and multilingual translation, including zero-shot translation, where they translate unseen language pairs. These systems have become essential for global communication, e-commerce, and customer support. MT continues to evolve, with research focusing on real-time applications, better handling of domain-specific terms, and multimodal translation that considers both text and visual data.

FLORES test set

FLORES+ is a multilingual machine translation benchmark released under CC BY-SA 4.0. This dataset was originally released by FAIR researchers at Meta under the name FLORES. The + has been added to the name to disambiguate between the original datasets and this new actively developed version. The data consists of translations primarily from English into around 200 language varieties. The original English sentences were sampled in equal amounts from Wikinews (an international news source), Wikijunior (a collection of age-appropriate non-fiction books), and Wikivoyage (a travel guide).

For each of the eight language pairs, the devtest split has been utilized.

NLLB Team, Marta R. Costa-jussa, James Cross, Onur Celebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillame Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Samarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, ChauTran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanui Goswami, Francisco Guzman, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang, 2024, Scaling neural machine translation to 200 languages, Nature 630, 841--846 (2024).

Machine translation

Language	en
cs
de
es
fr
it
nl
pt
ro

ACL6060 test set

Collection of ACL 2022 paper presentations for which pre-recorded audio or video presentations were provided to the ACL Anthology. Presentations include a variety of native and non-native English accents. Presentations have been professionally transcribed and translated into ten language pairs, including 4 European languages (German, Portuguese, Dutch, and French). The dataset was described in detail in “Elizabeth Salesky, Kareem Darwish, Mohamed Al-Badrashiny, Mona Diab, and Jan Niehues”, 2023, Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Terminology, in Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 62–78, Toronto, Canada, Association for Computational Linguistics publication.

Elizabeth Salesky, Kareem Darwish, Mohamed Al-Badrashiny, Mona Diab, Jan Niehues”, 2023, Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Terminology, in Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 62–78, Toronto, Canada, Association for Computational Linguistics.

Machine translation

Language	en
de
fr
nl
pt