Morphological segmentation of low-resource Indonesian dialects using unsupervised neural models

Muhammad Iqbal Ibrahim

Authors

Muhammad Iqbal Ibrahim Universitas Negeri Yogyakarta Author

Keywords:

Bayesian neural models, character-level modeling, low-resource languages, morphological segmentation, Indonesian dialects

Abstract

Background: Indonesia’s linguistic diversity, with many underrepresented dialects, poses challenges for morphological analysis under low-resource conditions. Objective: This study examines whether unsupervised neural models can learn morphological structure in Indonesian dialects without annotated data. Method: A comparative unsupervised approach was applied using probabilistic segmentation, character-level BiLSTM, and Bayesian neural models on raw dialectal corpora. Results: Probabilistic methods capture frequent roots and affixes but struggle with reduplication and clitics; character-level models better handle phonological variation, while Bayesian models achieve the most balanced performance with stronger coherence and cross-dialect generalization. Implication: These findings highlight the potential of unsupervised, probabilistic approaches to support inclusive language technology for low-resource dialects. Novelty: This study reconceptualizes dialectal morphology as a latent probabilistic system and demonstrates the effectiveness of Bayesian unsupervised neural segmentation.

References

[1] V. Demberg, “A Language-Independent Unsupervised Model for Morphological Segmentation,” pp. 920–927, Jun. 2007.

[2] A. Ustun and B. Can, “Incorporating word embeddings in unsupervised morphological segmentation,” Natural Language Engineering, vol. 27, pp. 609–629, Jul. 2020, doi: 10.1017/s1351324920000406.

[3] H. Poon, C. Cherry, and K. Toutanova, “Unsupervised Morphological Segmentation with Log-Linear Models,” pp. 209–217, May 2009, doi: 10.3115/1620754.1620785.

[4] T. Moeng, S. Reay, A. Daniels, and J. Buys, “Canonical and Surface Morphological Segmentation for Nguni Languages,” pp. 125–139, Apr. 2021, doi: 10.1007/978-3-030-95070-5_9.

[5] Z. Liu and E. T. Prudhommeaux, “Data-driven Model Generalizability in Crosslinguistic Low-resource Morphological Segmentation,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 393–413, Jan. 2022, doi: 10.1162/tacl_a_00467.

[6] C. Belth, “Meaning-Informed Low-Resource Segmentation of Agglutinative Morphology,” 2024.

[7] W. Salloum and N. Habash, “Unsupervised Arabic dialect segmentation for machine translation,” Natural Language Engineering, vol. 28, pp. 223–248, Sep. 2020, doi: 10.1017/s1351324920000455.

[8] S. Harrat, K. Meftouh, and K. Smaïli, “Script Independent Morphological Segmentation for Arabic Maghrebi Dialects: An Application to Machine Translation,” Computación y Sistemas, vol. 23, Sep. 2019, doi: 10.13053/cys-23-3-3267.

[9] C. Anderson, M. Nguyen, and R. Coto-Solano, “Unsupervised, Semi-Supervised and LLM-Based Morphological Segmentation for Bribri,” Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), Jan. 2025, doi: 10.18653/v1/2025.americasnlp-1.7.

[10] M. Mager, O. cCetinouglu, and K. Kann, “Tackling the Low-resource Challenge for Canonical Segmentation,” ArXiv, vol. abs/2010.02804, Oct. 2020, doi: 10.18653/v1/2020.emnlp-main.423.

[11] R. Eskander, “Unsupervised Morphological Segmentation and Part-of-Speech Tagging for Low-Resource Scenarios,” Jan. 2021, doi: 10.7916/d8-jd2d-9p51.

[12] R. Eskander, F. Callejas, E. Nichols, J. Klavans, and S. Muresan, “MorphAGram, Evaluation and Framework for Unsupervised Morphological Segmentation,” pp. 7112–7122, May 2020.

[13] H. Xu, M. Marcus, C. Yang, and L. Ungar, “Unsupervised Morphology Learning with Statistical Paradigms,” pp. 44–54, Aug. 2018.

[14] S. Goldwater and T. Bergmanis, “From Segmentation to Analyses: a Probabilistic Model for Unsupervised Morphology Induction,” pp. 337–346, Apr. 2017, doi: 10.18653/v1/e17-1032.

[15] A. Fawaid, R. Assyabani, I. Abdullah, C. Muali, M. S. Itqan, and S. Islam, “Human Intelligence and Algorithmic Precision: An Experimental Study of Indonesian Translation Pedagogy in Higher Education,” Asian Journal of University Education, vol. 21, no. 3, pp. 779–792, 2025, doi: 10.24191/ajue.v21i3.53.

[16] B. Snyder and R. Barzilay, “Unsupervised Multilingual Learning for Morphological Segmentation,” pp. 737–745, Jun. 2008.

[17] R. W. Ikomah and Z. H. Sain, “Text mining and semantic modeling of literary corpora: a machine learning–based study of Indonesian fiction,” Lingua Technica: Journal of Digital Literary Studies, vol. 2, no. 1, pp. 51–67, 2026, doi: 10.64595/lingtech.v2i1.133.

[18] F. Purwaningtias, D. Stiawan, Y. N. Kunang, and L. Lukman, “A Text-Based Recommendation System Analysis Using a Hybrid Machine Learning Model,” 2025 International Conference on Information and Communication Technology (ICoICT), pp. 1–6, Jul. 2025, doi: 10.1109/icoict66265.2025.11192938.

[19] P. Soille and P. Vogt, “Morphological segmentation of binary patterns,” Pattern Recognit. Lett., vol. 30, pp. 456–459, Mar. 2009, doi: 10.1016/j.patrec.2008.10.015.

[20] W. Chen and B. Fazio, “Morphologically-Guided Segmentation For Translation of Agglutinative Low-Resource Languages,” 2021, [Online]. Available: https://aclanthology.org/2021.mtsummit-loresmt.3/

[21] S. Liu and H. Yu, “What is newsworthy about Covid-19? A corpus linguistic analysis of news values in reports by China Daily and The New York Times,” Humanities and Social Sciences Communications, vol. 10, no. 1, 2023, doi: 10.1057/s41599-023-02241-5.

[22] A. Ç. Coşkun and H. Ş. Haştemoğlu, “Genetic Codes of Housing: Morphological Reading of Traditional Antalya Houses,” Buildings, vol. 15, no. 19, 2025, doi: 10.3390/buildings15193433.

[23] L. Cannas da Silva and T. V. Heitor, “Campuses as Sustainable Urban Enginesâ€”A Morphological Approach to Campus Social Sustainability,” in World Sustainability Series, 2017, pp. 259–276. doi: 10.1007/978-3-319-47889-0_19.

[24] S. Qiao, S. K. W. Chu, and S. S.-S. Yeung, “Understanding how gamification of English morphological analysis in a blended learning environment influences students’ engagement and reading comprehension,” Computer Assisted Language Learning, 2023, doi: 10.1080/09588221.2023.2230273.

[25] R. Yan, X. Jiang, and D. Dang, “Named Entity Recognition by Using XLNet-BiLSTM-CRF,” Neural Processing Letters, vol. 53, pp. 3339–3356, Jun. 2021, doi: 10.1007/s11063-021-10547-1.

[26] R. Wang, D. Zhou, H. Huang, and Y. Zhou, “MIT: Mutual Information Topic Model for Diverse Topic Extraction,” IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 2, pp. 2523–2537, 2025, doi: 10.1109/TNNLS.2024.3357698.

[27] Z. A. Mukharyahya, Y. P. Astuti, and O. N. Cahyani, “Perbandingan Naive Bayes dan Support Vector Machine dalam Klasifikasi Tingkat Kemiskinan di Indonesia,” Edumatic: Jurnal Pendidikan Informatika, vol. 9, no. 1, pp. 119–128, 2025.

Morphological segmentation of low-resource Indonesian dialects using unsupervised neural models

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

people

submission

journal policy

our editorial team

downloads

tools

citedness

visitors

Latest publications

Browse

Developed By

Make a Submission

Information

Language

Office: