TY - GEN
T1 - Awajun-OP
T2 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2024
AU - Veliz, Oscar Moreno
AU - Uwarai, Yanua Liseth Atamain
AU - Oncevay, Arturo
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - We introduce a Spanish-Awajun parallel dataset of 22k high-quality sentence pairs with the help of the journalistic organization Ojo Público1. This dataset consists of parallel data obtained from various web sources such as poems, stories, laws, protocols, guidelines, handbooks, the Bible, and news published by Ojo Público. The study also includes an analysis of the dataset’s performance for Spanish-Awajun translation using a Transformer architecture with transfer learning from a parent model, utilizing Spanish-English and Spanish-Finnish as high-resource language-pairs. As far as we know, this is the first Spanish-Awajun machine translation study, and we hope that this work will serve as a starting point for future research on this neglected Peruvian language. The dataset is released in the following URL: https://github.com/iapucp/Awajun-OP.
AB - We introduce a Spanish-Awajun parallel dataset of 22k high-quality sentence pairs with the help of the journalistic organization Ojo Público1. This dataset consists of parallel data obtained from various web sources such as poems, stories, laws, protocols, guidelines, handbooks, the Bible, and news published by Ojo Público. The study also includes an analysis of the dataset’s performance for Spanish-Awajun translation using a Transformer architecture with transfer learning from a parent model, utilizing Spanish-English and Spanish-Finnish as high-resource language-pairs. As far as we know, this is the first Spanish-Awajun machine translation study, and we hope that this work will serve as a starting point for future research on this neglected Peruvian language. The dataset is released in the following URL: https://github.com/iapucp/Awajun-OP.
UR - http://www.scopus.com/inward/record.url?scp=85216927851&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.americasnlp-1.12
DO - 10.18653/v1/2024.americasnlp-1.12
M3 - Conference contribution
AN - SCOPUS:85216927851
T3 - AmericasNLP 2024 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas - Proceedings of the Workshop
SP - 112
EP - 120
BT - AmericasNLP 2024 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas - Proceedings of the Workshop
A2 - Mager, Manuel
A2 - Ebrahimi, Abteen
A2 - Rijhwani, Shruti
A2 - Oncevay, Arturo
A2 - Chiruzzo, Luis
A2 - Pugh, Robert
A2 - von der Wense, Katharina
A2 - von der Wense, Katharina
PB - Association for Computational Linguistics (ACL)
Y2 - 21 June 2024
ER -