Awajun-OP: Multi-domain Dataset for Spanish–Awajun Machine Translation

Oscar Moreno Veliz, Yanua Liseth Atamain Uwarai, Arturo Oncevay

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

1 Cita (Scopus)

Resumen

We introduce a Spanish-Awajun parallel dataset of 22k high-quality sentence pairs with the help of the journalistic organization Ojo Público1. This dataset consists of parallel data obtained from various web sources such as poems, stories, laws, protocols, guidelines, handbooks, the Bible, and news published by Ojo Público. The study also includes an analysis of the dataset’s performance for Spanish-Awajun translation using a Transformer architecture with transfer learning from a parent model, utilizing Spanish-English and Spanish-Finnish as high-resource language-pairs. As far as we know, this is the first Spanish-Awajun machine translation study, and we hope that this work will serve as a starting point for future research on this neglected Peruvian language. The dataset is released in the following URL: https://github.com/iapucp/Awajun-OP.

Idioma originalInglés
Título de la publicación alojadaAmericasNLP 2024 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas - Proceedings of the Workshop
EditoresManuel Mager, Abteen Ebrahimi, Shruti Rijhwani, Arturo Oncevay, Luis Chiruzzo, Robert Pugh, Katharina von der Wense, Katharina von der Wense
EditorialAssociation for Computational Linguistics (ACL)
Páginas112-120
Número de páginas9
ISBN (versión digital)9798891761087
DOI
EstadoPublicada - 2024
Evento4th Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2024 - Mexico City, México
Duración: 21 jun. 2024 → …

Serie de la publicación

NombreAmericasNLP 2024 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas - Proceedings of the Workshop

Conferencia

Conferencia4th Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2024
País/TerritorioMéxico
CiudadMexico City
Período21/06/24 → …

Huella

Profundice en los temas de investigación de 'Awajun-OP: Multi-domain Dataset for Spanish–Awajun Machine Translation'. En conjunto forman una huella única.

Citar esto