Assessing back-translation as a corpus generation strategy for non-English tasks: A study in reading comprehension and word sense disambiguation

Fabricio Monsalve, Kervy Rivas-Rojas, Marco Antonio Sobrevilla Cabezudo, Arturo Oncevay

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

1 Cita (Scopus)

Resumen

Corpora curated by experts have sustained Natural Language Processing mainly in English, but the expensiveness of corpora creation is a barrier for the development in further languages. Thus, we propose a corpus generation strategy that only requires a machine translation system between English and the target language in both directions, where we filter the best translations by computing automatic translation metrics and the task performance score. By studying Reading Comprehension in Spanish and Word Sense Disambiguation in Portuguese, we identified that a more quality-oriented metric has high potential in the corpora selection without degrading the task performance. We conclude that it is possible to systematise the building of quality corpora using machine translation and automatic metrics, besides some prior effort to clean and process the data.

Idioma originalInglés
Título de la publicación alojadaLAW 2019 - 13th Linguistic Annotation Workshop, Proceedings of the Workshop
EditorialAssociation for Computational Linguistics (ACL)
Páginas81-89
Número de páginas9
ISBN (versión digital)9781950737383
EstadoPublicada - 2019
Publicado de forma externa
Evento13th Linguistic Annotation Workshop, LAW 2019, held in conjunction with the Annual Meeting of the Association for Computational Linguistics, ACL 2019 - Florence, Italia
Duración: 1 ago. 2019 → …

Serie de la publicación

NombreLAW 2019 - 13th Linguistic Annotation Workshop, Proceedings of the Workshop

Conferencia

Conferencia13th Linguistic Annotation Workshop, LAW 2019, held in conjunction with the Annual Meeting of the Association for Computational Linguistics, ACL 2019
País/TerritorioItalia
CiudadFlorence
Período1/08/19 → …

Huella

Profundice en los temas de investigación de 'Assessing back-translation as a corpus generation strategy for non-English tasks: A study in reading comprehension and word sense disambiguation'. En conjunto forman una huella única.

Citar esto