Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

Vivek Iyer, Arturo Oncevay, Alexandra Birch

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

1 Cita (Scopus)

Resumen

Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains — owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-to-many word translations are generated using a `base' NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. Lastly, through ablation studies, we highlight the major code-switching aspects (including context, many-to-many substitutions, code-switching language count etc.) that contribute to the enhanced pretraining of multilingual NMT models.

Idioma originalInglés
Título de la publicación alojadaEACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023
EditorialAssociation for Computational Linguistics (ACL)
Páginas954-968
Número de páginas15
ISBN (versión digital)9781959429470
EstadoPublicada - 2023
Publicado de forma externa
Evento17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Findings of EACL 2023 - Dubrovnik, Croacia
Duración: 2 may. 20236 may. 2023

Serie de la publicación

NombreEACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023

Conferencia

Conferencia17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Findings of EACL 2023
País/TerritorioCroacia
CiudadDubrovnik
Período2/05/236/05/23

Huella

Profundice en los temas de investigación de 'Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation'. En conjunto forman una huella única.

Citar esto