Resumen
Language modelling and machine translation tasks mostly use subword or character inputs, but syllables are seldom used. Syllables provide shorter sequences than characters, require less-specialised extracting rules than morphemes, and their segmentation is not impacted by the corpus size. In this study, we first explore the potential of syllables for open-vocabulary language modelling in 21 languages. We use rule-based syllabification methods for six languages and address the rest with hyphenation, which works as a syllabification proxy. With a comparable perplexity, we show that syllables outperform characters and other subwords. Moreover, we study the importance of syllables on neural machine translation for a non-related and low-resource language-pair (Spanish–Shipibo-Konibo). In pairwise and multilingual systems, syllables outperform unsupervised subwords, and further morphological segmentation methods, when translating into a highly synthetic language with a transparent orthography (Shipibo-Konibo). Finally, we perform some human evaluation, and discuss limitations and opportunities.
| Idioma original | Inglés |
|---|---|
| Páginas (desde-hasta) | 4258-4267 |
| Número de páginas | 10 |
| Publicación | Proceedings - International Conference on Computational Linguistics, COLING |
| Volumen | 29 |
| N.º | 1 |
| Estado | Publicada - 2022 |
| Evento | 29th International Conference on Computational Linguistics, COLING 2022 - Hybrid, Gyeongju, República de Corea Duración: 12 oct. 2022 → 17 oct. 2022 |
Huella
Profundice en los temas de investigación de 'Revisiting Syllables in Language Modelling and their Application on Low-Resource Machine Translation'. En conjunto forman una huella única.Citar esto
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver