TY - GEN
T1 - Peru is Multilingual, Its Machine Translation Should Be Too?
AU - Oncevay, Arturo
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics
PY - 2021
Y1 - 2021
N2 - Peru is a multilingual country with a long history of contact between the indigenous languages and Spanish. Taking advantage of this context for machine translation is possible with multilingual approaches for learning both unsupervised subword segmentation and neural machine translation models. The study proposes the first multilingual translation models for four languages spoken in Peru: Aymara, Ashaninka, Quechua and Shipibo-Konibo, providing both many-to-Spanish and Spanish-to-many models and outperforming pairwise baselines in most of them. The task exploited a large English-Spanish dataset for pretraining, monolingual texts with tagged back-translation, and parallel corpora aligned with English. Finally, by fine-tuning the best models, we also assessed the out-of-domain capabilities in two evaluation datasets for Quechua and a new one for Shipibo-Konibo1.
AB - Peru is a multilingual country with a long history of contact between the indigenous languages and Spanish. Taking advantage of this context for machine translation is possible with multilingual approaches for learning both unsupervised subword segmentation and neural machine translation models. The study proposes the first multilingual translation models for four languages spoken in Peru: Aymara, Ashaninka, Quechua and Shipibo-Konibo, providing both many-to-Spanish and Spanish-to-many models and outperforming pairwise baselines in most of them. The task exploited a large English-Spanish dataset for pretraining, monolingual texts with tagged back-translation, and parallel corpora aligned with English. Finally, by fine-tuning the best models, we also assessed the out-of-domain capabilities in two evaluation datasets for Quechua and a new one for Shipibo-Konibo1.
UR - http://www.scopus.com/inward/record.url?scp=85123956688&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85123956688
T3 - Proceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021
SP - 194
EP - 201
BT - Proceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021
A2 - Mager, Manuel
A2 - Oncevay, Arturo
A2 - Rios, Annette
A2 - Ruiz, Ivan Vladimir Meza
A2 - Palmer, Alexis
A2 - Neubig, Graham
A2 - Kann, Katharina
PB - Association for Computational Linguistics (ACL)
T2 - 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021
Y2 - 11 June 2021
ER -