Skip to main navigation Skip to search Skip to main content

Peru is Multilingual, Its Machine Translation Should Be Too?

  • Arturo Oncevay
  • ILCC University of Edinburgh

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

12 Scopus citations

Abstract

Peru is a multilingual country with a long history of contact between the indigenous languages and Spanish. Taking advantage of this context for machine translation is possible with multilingual approaches for learning both unsupervised subword segmentation and neural machine translation models. The study proposes the first multilingual translation models for four languages spoken in Peru: Aymara, Ashaninka, Quechua and Shipibo-Konibo, providing both many-to-Spanish and Spanish-to-many models and outperforming pairwise baselines in most of them. The task exploited a large English-Spanish dataset for pretraining, monolingual texts with tagged back-translation, and parallel corpora aligned with English. Finally, by fine-tuning the best models, we also assessed the out-of-domain capabilities in two evaluation datasets for Quechua and a new one for Shipibo-Konibo1.

Original languageEnglish
Title of host publicationProceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021
EditorsManuel Mager, Arturo Oncevay, Annette Rios, Ivan Vladimir Meza Ruiz, Alexis Palmer, Graham Neubig, Katharina Kann
PublisherAssociation for Computational Linguistics (ACL)
Pages194-201
Number of pages8
ISBN (Electronic)9781954085442
DOIs
StatePublished - 2021
Externally publishedYes
Event1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021 - Virtual, Online
Duration: 11 Jun 2021 → …

Publication series

NameProceedings of the 1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021

Conference

Conference1st Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2021
CityVirtual, Online
Period11/06/21 → …

Fingerprint

Dive into the research topics of 'Peru is Multilingual, Its Machine Translation Should Be Too?'. Together they form a unique fingerprint.

Cite this