TY - GEN
T1 - BPE vs. Morphological Segmentation
T2 - 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022
AU - Mager, Manuel
AU - Oncevay, Arturo
AU - Mager, Elisabeth
AU - Kann, Katharina
AU - Vu, Ngoc Thang
N1 - Publisher Copyright:
© 2022 Association for Computational Linguistics.
PY - 2022
Y1 - 2022
N2 - Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri-Spanish.
AB - Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri-Spanish.
UR - http://www.scopus.com/inward/record.url?scp=85139897186&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85139897186
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 961
EP - 971
BT - ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Findings of ACL 2022
A2 - Muresan, Smaranda
A2 - Nakov, Preslav
A2 - Villavicencio, Aline
PB - Association for Computational Linguistics (ACL)
Y2 - 22 May 2022 through 27 May 2022
ER -