TY - GEN
T1 - Unlocking Knowledge with OCR-Driven Document Digitization for Peruvian Indigenous Languages
AU - Sánchez, Shadya
AU - Zariquiey, Roberto
AU - Oncevay, Arturo
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - The current focus on resource-rich languages poses a challenge to linguistic diversity, affecting minority languages with limited digital presence and relatively old published and unpublished resources. In addressing this issue, this study targets the digitalization of old scanned textbooks written in four Peruvian indigenous languages (Asháninka, Shipibo-Konibo, Yanesha, and Yine) using Optical Character Recognition (OCR) technology. This is complemented with text correction methods to minimize extraction errors. Contributions include the creation of an annotated dataset with 454 scanned page images, for a rigorous evaluation, and the development of a module to correct OCR-generated transcription alignments.
AB - The current focus on resource-rich languages poses a challenge to linguistic diversity, affecting minority languages with limited digital presence and relatively old published and unpublished resources. In addressing this issue, this study targets the digitalization of old scanned textbooks written in four Peruvian indigenous languages (Asháninka, Shipibo-Konibo, Yanesha, and Yine) using Optical Character Recognition (OCR) technology. This is complemented with text correction methods to minimize extraction errors. Contributions include the creation of an annotated dataset with 454 scanned page images, for a rigorous evaluation, and the development of a module to correct OCR-generated transcription alignments.
UR - http://www.scopus.com/inward/record.url?scp=85216926072&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.americasnlp-1.11
DO - 10.18653/v1/2024.americasnlp-1.11
M3 - Conference contribution
AN - SCOPUS:85216926072
T3 - AmericasNLP 2024 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas - Proceedings of the Workshop
SP - 103
EP - 111
BT - AmericasNLP 2024 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas - Proceedings of the Workshop
A2 - Mager, Manuel
A2 - Ebrahimi, Abteen
A2 - Rijhwani, Shruti
A2 - Oncevay, Arturo
A2 - Chiruzzo, Luis
A2 - Pugh, Robert
A2 - von der Wense, Katharina
A2 - von der Wense, Katharina
PB - Association for Computational Linguistics (ACL)
T2 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2024
Y2 - 21 June 2024
ER -