TY - GEN
T1 - SchAman
T2 - 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, AACL-IJCNLP 2022
AU - Oncevay, Arturo
AU - Cardoso, Gerardo
AU - Alva, Carlo
AU - Ávila, César Lara
AU - Balarezo, Jovita Vásquez
AU - Rodríguez, Saúl Escobar
AU - Camaiteri, Delio Siticonatzi
AU - Rojas, Esaú Zumaeta
AU - Francis, Didier L.
AU - Bautista, Juan L.
AU - Rios, Nimia Acho
AU - Cesareo, Remigio Zapata
AU - Montoya, Héctor Erasmo Gómez
AU - Zariquiey, Roberto
N1 - Publisher Copyright:
© 2022 Association for Computational Linguistics.
PY - 2022
Y1 - 2022
N2 - Spell-checkers are core applications in language learning and normalisation, which may enormously contribute to language revitalisation and language teaching in the context of indigenous communities. Spell-checking as a generation task, however, requires large amount of data, which is not feasible for endangered languages, such as the languages spoken in Peruvian Amazonia. We propose here augmentation methods for various misspelling types as a strategy to train neural spell-checking models and we create an evaluation resource for four indigenous languages of Peru: Shipibo-Konibo, Asháninka, Yánesha, Yine. We focus on special errors that are significant for learning these languages, such as phoneme-to-grapheme ambiguity, grammatical errors (gender, tense, number, among others), accentuation, punctuation and normalisation in contexts where two or more writing traditions co-exist. We found that an ensemble model, trained with augmented data from various types of error achieves overall better scores in most of the error types and languages. Finally, we released our spell-checkers as a web service to be used by indigenous communities and organisations to develop future language materials.
AB - Spell-checkers are core applications in language learning and normalisation, which may enormously contribute to language revitalisation and language teaching in the context of indigenous communities. Spell-checking as a generation task, however, requires large amount of data, which is not feasible for endangered languages, such as the languages spoken in Peruvian Amazonia. We propose here augmentation methods for various misspelling types as a strategy to train neural spell-checking models and we create an evaluation resource for four indigenous languages of Peru: Shipibo-Konibo, Asháninka, Yánesha, Yine. We focus on special errors that are significant for learning these languages, such as phoneme-to-grapheme ambiguity, grammatical errors (gender, tense, number, among others), accentuation, punctuation and normalisation in contexts where two or more writing traditions co-exist. We found that an ensemble model, trained with augmented data from various types of error achieves overall better scores in most of the error types and languages. Finally, we released our spell-checkers as a web service to be used by indigenous communities and organisations to develop future language materials.
UR - https://www.scopus.com/pages/publications/105027171685
U2 - 10.18653/v1/2022.aacl-short.51
DO - 10.18653/v1/2022.aacl-short.51
M3 - Conference contribution
AN - SCOPUS:105027171685
T3 - Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Long Paper, AACL-IJCNLP 2022
SP - 511
EP - 517
BT - Student Research Workshop
A2 - Hanqi, Yan
A2 - Zonghan, Yang
A2 - Ruder, Sebastian
A2 - Xiaojun, Wan
PB - Association for Computational Linguistics (ACL)
Y2 - 20 November 2022 through 23 November 2022
ER -