A Low-Resourced Peruvian Language Identification Model

Alexandra Espichán Linares, Arturo Oncevay-Marcos

Producción científica: Contribución a una revistaArtículo de la conferenciarevisión exhaustiva

Resumen

Due to the linguistic revitalization in Perú through the last years, there is a growing interest to reinforce the bilingual education in the country and to increase the research focused in its native languages. From the computer science perspective, one of the first steps to support the languages study is the implementation of an automatic language identification tool using machine learning methods. Therefore, this work focuses in two steps: (1) the building of a digital and annotated corpus for 16 Peruvian native languages extracted from documents in web repositories, and (2) the fit of a supervised learning model for the language identification task using features identified from related studies in the state of the art, such as n-grams. The obtained results were promising (97% in average precision), and it is expected to take advantage of the corpus and the model for more complex tasks in the future.

Idioma originalInglés
Páginas (desde-hasta)57-63
Número de páginas7
PublicaciónCEUR Workshop Proceedings
Volumen2029
EstadoPublicada - 2017
Evento4th Annual International Symposium on Information Management and Big Data, SIMBig 2017 - Lima, Perú
Duración: 4 set. 20176 set. 2017

Huella

Profundice en los temas de investigación de 'A Low-Resourced Peruvian Language Identification Model'. En conjunto forman una huella única.

Citar esto