TY - GEN
T1 - Assessing back-translation as a corpus generation strategy for non-English tasks
T2 - 13th Linguistic Annotation Workshop, LAW 2019, held in conjunction with the Annual Meeting of the Association for Computational Linguistics, ACL 2019
AU - Monsalve, Fabricio
AU - Rivas-Rojas, Kervy
AU - Sobrevilla Cabezudo, Marco Antonio
AU - Oncevay, Arturo
N1 - Publisher Copyright:
© 2019 Association for Computational Linguistics
PY - 2019
Y1 - 2019
N2 - Corpora curated by experts have sustained Natural Language Processing mainly in English, but the expensiveness of corpora creation is a barrier for the development in further languages. Thus, we propose a corpus generation strategy that only requires a machine translation system between English and the target language in both directions, where we filter the best translations by computing automatic translation metrics and the task performance score. By studying Reading Comprehension in Spanish and Word Sense Disambiguation in Portuguese, we identified that a more quality-oriented metric has high potential in the corpora selection without degrading the task performance. We conclude that it is possible to systematise the building of quality corpora using machine translation and automatic metrics, besides some prior effort to clean and process the data.
AB - Corpora curated by experts have sustained Natural Language Processing mainly in English, but the expensiveness of corpora creation is a barrier for the development in further languages. Thus, we propose a corpus generation strategy that only requires a machine translation system between English and the target language in both directions, where we filter the best translations by computing automatic translation metrics and the task performance score. By studying Reading Comprehension in Spanish and Word Sense Disambiguation in Portuguese, we identified that a more quality-oriented metric has high potential in the corpora selection without degrading the task performance. We conclude that it is possible to systematise the building of quality corpora using machine translation and automatic metrics, besides some prior effort to clean and process the data.
UR - http://www.scopus.com/inward/record.url?scp=85084294944&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85084294944
T3 - LAW 2019 - 13th Linguistic Annotation Workshop, Proceedings of the Workshop
SP - 81
EP - 89
BT - LAW 2019 - 13th Linguistic Annotation Workshop, Proceedings of the Workshop
PB - Association for Computational Linguistics (ACL)
Y2 - 1 August 2019
ER -