BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Manuel Mager, Arturo Oncevay, Elisabeth Mager, Katharina Kann, Ngoc Thang Vu

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

12 Citas (Scopus)

Resumen

Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri-Spanish.

Idioma originalInglés
Título de la publicación alojadaACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Findings of ACL 2022
EditoresSmaranda Muresan, Preslav Nakov, Aline Villavicencio
EditorialAssociation for Computational Linguistics (ACL)
Páginas961-971
Número de páginas11
ISBN (versión digital)9781955917254
EstadoPublicada - 2022
Publicado de forma externa
Evento60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 - Dublin, Irlanda
Duración: 22 may. 202227 may. 2022

Serie de la publicación

NombreProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (versión impresa)0736-587X

Conferencia

Conferencia60th Annual Meeting of the Association for Computational Linguistics, ACL 2022
País/TerritorioIrlanda
CiudadDublin
Período22/05/2227/05/22

Huella

Profundice en los temas de investigación de 'BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages'. En conjunto forman una huella única.

Citar esto