Assessing back-translation as a corpus generation strategy for non-English tasks: A study in reading comprehension and word sense disambiguation

Fabricio Monsalve, Kervy Rivas-Rojas, Marco Antonio Sobrevilla Cabezudo, Arturo Oncevay

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Corpora curated by experts have sustained Natural Language Processing mainly in English, but the expensiveness of corpora creation is a barrier for the development in further languages. Thus, we propose a corpus generation strategy that only requires a machine translation system between English and the target language in both directions, where we filter the best translations by computing automatic translation metrics and the task performance score. By studying Reading Comprehension in Spanish and Word Sense Disambiguation in Portuguese, we identified that a more quality-oriented metric has high potential in the corpora selection without degrading the task performance. We conclude that it is possible to systematise the building of quality corpora using machine translation and automatic metrics, besides some prior effort to clean and process the data.

Original languageEnglish
Title of host publicationLAW 2019 - 13th Linguistic Annotation Workshop, Proceedings of the Workshop
PublisherAssociation for Computational Linguistics (ACL)
Pages81-89
Number of pages9
ISBN (Electronic)9781950737383
StatePublished - 2019
Externally publishedYes
Event13th Linguistic Annotation Workshop, LAW 2019, held in conjunction with the Annual Meeting of the Association for Computational Linguistics, ACL 2019 - Florence, Italy
Duration: 1 Aug 2019 → …

Publication series

NameLAW 2019 - 13th Linguistic Annotation Workshop, Proceedings of the Workshop

Conference

Conference13th Linguistic Annotation Workshop, LAW 2019, held in conjunction with the Annual Meeting of the Association for Computational Linguistics, ACL 2019
Country/TerritoryItaly
CityFlorence
Period1/08/19 → …

Fingerprint

Dive into the research topics of 'Assessing back-translation as a corpus generation strategy for non-English tasks: A study in reading comprehension and word sense disambiguation'. Together they form a unique fingerprint.

Cite this