Unlocking Knowledge with OCR-Driven Document Digitization for Peruvian Indigenous Languages

Shadya Sánchez, Roberto Zariquiey, Arturo Oncevay

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

1 Cita (Scopus)

Resumen

The current focus on resource-rich languages poses a challenge to linguistic diversity, affecting minority languages with limited digital presence and relatively old published and unpublished resources. In addressing this issue, this study targets the digitalization of old scanned textbooks written in four Peruvian indigenous languages (Asháninka, Shipibo-Konibo, Yanesha, and Yine) using Optical Character Recognition (OCR) technology. This is complemented with text correction methods to minimize extraction errors. Contributions include the creation of an annotated dataset with 454 scanned page images, for a rigorous evaluation, and the development of a module to correct OCR-generated transcription alignments.

Idioma originalInglés
Título de la publicación alojadaAmericasNLP 2024 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas - Proceedings of the Workshop
EditoresManuel Mager, Abteen Ebrahimi, Shruti Rijhwani, Arturo Oncevay, Luis Chiruzzo, Robert Pugh, Katharina von der Wense, Katharina von der Wense
EditorialAssociation for Computational Linguistics (ACL)
Páginas103-111
Número de páginas9
ISBN (versión digital)9798891761087
DOI
EstadoPublicada - 2024
Evento4th Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2024 - Mexico City, México
Duración: 21 jun. 2024 → …

Serie de la publicación

NombreAmericasNLP 2024 - 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas - Proceedings of the Workshop

Conferencia

Conferencia4th Workshop on Natural Language Processing for Indigenous Languages of the Americas, AmericasNLP 2024
País/TerritorioMéxico
CiudadMexico City
Período21/06/24 → …

Huella

Profundice en los temas de investigación de 'Unlocking Knowledge with OCR-Driven Document Digitization for Peruvian Indigenous Languages'. En conjunto forman una huella única.

Citar esto