SchAman: Spell-Checking Resources and Benchmark for Endangered Languages from Amazonia

  • Arturo Oncevay
  • , Gerardo Cardoso
  • , Carlo Alva
  • , César Lara Ávila
  • , Jovita Vásquez Balarezo
  • , Saúl Escobar Rodríguez
  • , Delio Siticonatzi Camaiteri
  • , Esaú Zumaeta Rojas
  • , Didier L. Francis
  • , Juan L. Bautista
  • , Nimia Acho Rios
  • , Remigio Zapata Cesareo
  • , Héctor Erasmo Gómez Montoya
  • , Roberto Zariquiey

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

3 Citas (Scopus)

Resumen

Spell-checkers are core applications in language learning and normalisation, which may enormously contribute to language revitalisation and language teaching in the context of indigenous communities. Spell-checking as a generation task, however, requires large amount of data, which is not feasible for endangered languages, such as the languages spoken in Peruvian Amazonia. We propose here augmentation methods for various misspelling types as a strategy to train neural spell-checking models and we create an evaluation resource for four indigenous languages of Peru: Shipibo-Konibo, Asháninka, Yánesha, Yine. We focus on special errors that are significant for learning these languages, such as phoneme-to-grapheme ambiguity, grammatical errors (gender, tense, number, among others), accentuation, punctuation and normalisation in contexts where two or more writing traditions co-exist. We found that an ensemble model, trained with augmented data from various types of error achieves overall better scores in most of the error types and languages. Finally, we released our spell-checkers as a web service to be used by indigenous communities and organisations to develop future language materials.

Idioma originalInglés
Título de la publicación alojadaStudent Research Workshop
EditoresYan Hanqi, Yang Zonghan, Sebastian Ruder, Wan Xiaojun
EditorialAssociation for Computational Linguistics (ACL)
Páginas511-517
Número de páginas7
ISBN (versión digital)9781955917568
DOI
EstadoPublicada - 2022
Evento2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, AACL-IJCNLP 2022 - Virtual, Online
Duración: 20 nov. 202223 nov. 2022

Serie de la publicación

NombreProceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Long Paper, AACL-IJCNLP 2022
Volumen3

Conferencia

Conferencia2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, AACL-IJCNLP 2022
CiudadVirtual, Online
Período20/11/2223/11/22

Huella

Profundice en los temas de investigación de 'SchAman: Spell-Checking Resources and Benchmark for Endangered Languages from Amazonia'. En conjunto forman una huella única.

Citar esto