Awajun-OP: Multi-domain Dataset for Spanish–Awajun Machine Translation

Oscar Moreno Veliz, Yanua Liseth Atamain Uwarai, Arturo Oncevay

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

We introduce a Spanish-Awajun parallel dataset of 22k high-quality sentence pairs with the help of the journalistic organization Ojo Público1. This dataset consists of parallel data obtained from various web sources such as poems, stories, laws, protocols, guidelines, handbooks, the Bible, and news published by Ojo Público. The study also includes an analysis of the dataset’s performance for Spanish-Awajun translation using a Transformer architecture with transfer learning from a parent model, utilizing Spanish-English and Spanish-Finnish as high-resource language-pairs. As far as we know, this is the first Spanish-Awajun machine translation study, and we hope that this work will serve as a starting point for future research on this neglected Peruvian language. The dataset is released in the following URL: https://github.com/iapucp/Awajun-OP.

Original languageEnglish
Title of host publicationAmericasNLP 2024 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas - Proceedings of the Workshop
EditorsManuel Mager, Abteen Ebrahimi, Shruti Rijhwani, Arturo Oncevay, Luis Chiruzzo, Robert Pugh, Katharina von der Wense, Katharina von der Wense
PublisherAssociation for Computational Linguistics (ACL)
Pages112-120
Number of pages9
ISBN (Electronic)9798891761087
DOIs
StatePublished - 2024
Event4th Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2024 - Mexico City, Mexico
Duration: 21 Jun 2024 → …

Publication series

NameAmericasNLP 2024 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas - Proceedings of the Workshop

Conference

Conference4th Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2024
Country/TerritoryMexico
CityMexico City
Period21/06/24 → …

Fingerprint

Dive into the research topics of 'Awajun-OP: Multi-domain Dataset for Spanish–Awajun Machine Translation'. Together they form a unique fingerprint.

Cite this