FLORES

FLORES+ is a multilingual machine translation benchmark released under CC BY-SA 4.0. This dataset was originally released by FAIR researchers at Meta under the name FLORES. The + has been added to the name to disambiguate between the original datasets and this new actively developed version. The data consists of translations primarily from English into around 200 language varieties. The original English sentences were sampled in equal amounts from Wikinews (an international news source), Wikijunior (a collection of age-appropriate non-fiction books), and Wikivoyage (a travel guide).

For each of the eight language pairs, the devtest split has been utilized.

NLLB Team, Marta R. Costa-jussa, James Cross, Onur Celebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillame Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Samarley Jarrett, Kaushik Ram Sadagopan, Dirk  Rowe, Shannon Spruit, ChauTran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanui Goswami, Francisco Guzman, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang, 2024, Scaling neural machine translation to 200 languages, Nature 630, 841--846 (2024).

Machine translation

Language en
cs
de
es
fr
it
nl
pt
ro