Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition

Rodolfo Zevallos, Luis Camacho, Nelsi Melgarejo

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

Resumen

The Huqariq corpus is a multilingual collection of speech from native Peruvian languages. The transcribed corpus is intended for the research and development of speech technologies to preserve endangered languages in Peru. Huqariq is primarily designed for the development of automatic speech recognition, language identification and text-to-speech tools. In order to achieve corpus collection sustainably, we employ the crowdsourcing methodology. Huqariq includes four native languages of Peru, and it is expected that by the end of the year 2022, it can reach up to 20 native languages out of the 48 native languages in Peru. The corpus has 220 hours of transcribed audio recorded by more than 500 volunteers, making it the largest speech corpus for native languages in Peru. In order to verify the quality of the corpus, we present speech recognition experiments using 220 hours of fully transcribed audio.

Idioma originalInglés
Título de la publicación alojada2022 Language Resources and Evaluation Conference, LREC 2022
EditoresNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis
EditorialEuropean Language Resources Association (ELRA)
Páginas5029-5034
Número de páginas6
ISBN (versión digital)9791095546726
EstadoPublicada - 2022
Evento13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, Francia
Duración: 20 jun. 202225 jun. 2022

Serie de la publicación

Nombre2022 Language Resources and Evaluation Conference, LREC 2022

Conferencia

Conferencia13th International Conference on Language Resources and Evaluation Conference, LREC 2022
País/TerritorioFrancia
CiudadMarseille
Período20/06/2225/06/22

Huella

Profundice en los temas de investigación de 'Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition'. En conjunto forman una huella única.

Citar esto