Code-mixing detection in Indonesian–English tweets using machine learning and linguistic features
Keywords:
code-mixing detection, linguistic features, machine learning, multilingual NLP, Twitter dataAbstract
Background: The increasing prevalence of Indonesian–English code-mixing in social media reflects sociolinguistic shifts driven by globalization, while challenging language processing systems that assume monolingual input. Objective: This study evaluates the effectiveness of machine learning models in detecting Indonesian–English code-mixing in Twitter data and the role of linguistic features in improving accuracy. Method: A supervised approach was applied using SVM and Random Forest classifiers on annotated tweets, enriched with features such as part-of-speech patterns, token-level language identification, and morphological markers. Results: SVM models outperform baselines with high accuracy and balanced precision–recall, while linguistic features significantly enhance detection, especially for intra-word mixing; errors mainly arise from lexical borrowing, short contexts, and morphologically integrated forms. Implication: These findings emphasize the importance of integrating linguistic knowledge into computational models to improve robustness in multilingual and low-resource settings. Novelty: This study demonstrates that linguistically informed machine learning frameworks enhance both performance and interpretability in detecting Indonesian–English code-mixing.
References
[1] A. F. Hidayatullah, R. Apong, D. Lai, and A. Qazi, “Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets,” PeerJ Computer Science, vol. 9, Jun. 2023, doi: 10.7717/peerj-cs.1312.
[2] A. F. Hidayatullah, R. Apong, D. T. C. Lai, and A. Qazi, “Pre-trained language model for code-mixed text in Indonesian, Javanese, and English using transformer,” Social Network Analysis and Mining, vol. 15, Mar. 2025, doi: 10.1007/s13278-025-01444-9.
[3] L. W. Astuti, Y. Sari, and Suprapto, “Code-Mixed Sentiment Analysis using Transformer for Twitter Social Media Data,” International Journal of Advanced Computer Science and Applications, Jan. 2023, doi: 10.14569/ijacsa.2023.0141053.
[4] M. Orosoo et al., “Analysing Code-Mixed Text in Programming Instruction Through Machine Learning for Feature Extraction,” International Journal of Advanced Computer Science and Applications, Jan. 2024, doi: 10.14569/ijacsa.2024.0150788.
[5] K. Shanmugavadivel et al., “An analysis of machine learning models for sentiment analysis of Tamil code-mixed data,” Comput. Speech Lang., vol. 76, p. 101407, May 2022, doi: 10.1016/j.csl.2022.101407.
[6] C. Tho, Y. Heryadi, L. Lukas, and A. Wibowo, “Code-mixed sentiment analysis of Indonesian language and Javanese language using Lexicon based approach,” Journal of Physics: Conference Series, vol. 1869, Apr. 2021, doi: 10.1088/1742-6596/1869/1/012084.
[7] M. A. Rosid, D. O. Siahaan, and A. Saikhu, “Sarcasm Detection in Indonesian-English Code-Mixed Text Using Multihead Attention-Based Convolutional and Bi-Directional GRU,” IEEE Access, vol. 12, pp. 137063–137079, 2024, doi: 10.1109/access.2024.3436107.
[8] A. Alhazmi, R. Mahmud, N. Idris, M. E. M. Abo, and C. Eke, “Code-mixing unveiled: Enhancing the hate speech detection in Arabic dialect tweets using machine learning models,” PLOS ONE, vol. 19, Jul. 2024, doi: 10.1371/journal.pone.0305657.
[9] M. Pandey, N. P. Yadav, M. Adduru, and S. Rai, “Creating and Evaluating Code-Mixed Nepali-English and Telugu-English Datasets for Abusive Language Detection Using Traditional and Deep Learning Models,” ArXiv, vol. abs/2504.21026, Apr. 2025, doi: 10.48550/arxiv.2504.21026.
[10] C. Nabila and A. Idayani, “An Analysis of Indonesian-Eglish Code Mixing Used in Social Media (TWITTER),” J-SHMIC : Journal of English for Academic, Feb. 2022, doi: 10.25299/jshmic.2022.vol9(1).9036.
[11] A. Fawaid, R. Assyabani, I. Abdullah, C. Muali, M. S. Itqan, and S. Islam, “Human Intelligence and Algorithmic Precision: An Experimental Study of Indonesian Translation Pedagogy in Higher Education,” Asian Journal of University Education, vol. 21, no. 3, pp. 779–792, 2025, doi: 10.24191/ajue.v21i3.53.
[12] A. Shakith and L. Arockiam, “Enhancing classification accuracy on code-mixed and imbalanced data using an adaptive deep autoencoder and XGBoost,” The Scientific Temper, Jul. 2024, doi: 10.58414/scientifictemper.2024.15.3.27.
[13] M. Sivakumar, “Improving Sentiment Analysis of Tamil-English Code-Mixed Sentences,” American Journal of Student Research, Jan. 2025, doi: 10.70251/hyjr2348.36574580.
[14] N. Hussain, A. Qasim, G. Mehak, O. Kolesnikova, A. Gelbukh, and G. Sidorov, “ORUD-Detect: A Comprehensive Approach to Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning-Deep Learning Models with Embedding Techniques,” Inf., vol. 16, p. 139, Feb. 2025, doi: 10.3390/info16020139.
[15] F. Dinarta and A. Wicaksana, “Enhanced Hate Speech Detection in Indonesian-English Code-Mixed Texts Using XLM-RoBERTa,” Informatica (Slovenia), vol. 49, May 2025, doi: 10.31449/inf.v49i21.7713.
[16] S. Crossley, “Developing Linguistic Constructs of Text Readability Using Natural Language Processing,” Scientific Studies of Reading, vol. 29, pp. 138–160, Nov. 2024, doi: 10.1080/10888438.2024.2422365.
[17] M. Pradeep, T. Sasivardhan, G. Bodana, K. Shilpa, K. Savalapurapu, and G. C. Babu, “Natural Language Processing for Literacy Text Mining: Extracting Knowledge From British National Corpus,” presented at the Proceedings of the 6th International Conference on Inventive Research in Computing Applications, ICIRCA 2025, 2025, pp. 1816–1821. doi: 10.1109/ICIRCA65293.2025.11089848.
[18] N. Varghese and M. Punithavalli, “Lexical and semantic analysis of sacred texts using machine learning and natural language processing,” International Journal of Scientific and Technology Research, vol. 8, no. 12, pp. 3133–3140, 2019.
[19] A. Rozaq et al., “Legal Literacy in Indonesia: Leveraging Semantic-Based AI and NLP for Enhanced Civil Law Access,” E3S Web of Conferences, Jan. 2025, doi: 10.1051/e3sconf/202562203002.
[20] A. Petropoulos and V. Siakoulis, “Can central bank speeches predict financial market turbulence? Evidence from an adaptive NLP sentiment index analysis using XGBoost machine learning technique,” Central Bank Review, Dec. 2021, doi: 10.1016/j.cbrev.2021.12.002.
[21] C. Anderson, M. Nguyen, and R. Coto-Solano, “Unsupervised, Semi-Supervised and LLM-Based Morphological Segmentation for Bribri,” Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), Jan. 2025, doi: 10.18653/v1/2025.americasnlp-1.7.
[22] A. Fawaid, I. Abdullah, H. Baharun, S. Aimah, R. Faishol, and N. Hidayati, “The Role of Online Game Simulation Based Interactive Textbooks to Reduce at-Risk Students’ Anxiety in Indonesian Language Subject,” in 2024 International Conference on Decision Aid Sciences and Applications (DASA), IEEE, Dec. 2024, pp. 1–7. doi: 10.1109/dasa63652.2024.10836301.
[23] S. N. Zahiro, A. Fawaid, and M. Iqbal, “The Role of Game Based Learning in Reducing Students’ Digital Distraction in Indonesian Language Classroom,” Paedagoria: Jurnal Kajian, Penelitian, dan Pengembangan Kependidikan, vol. 17, no. 1, pp. 10–22, 2026, doi: 10.31764/paedagoria.v17i1.35750.
[24] F. Solihin and B. Choir Aidifta, “A Madurese-Indonesian machine translation system using an encoder-decoder model with an attention-based LSTM architecture,” presented at the EPJ Web of Conferences, 2025. doi: 10.1051/epjconf/202534401028.
[25] K. A. Anindita, S. Hockey, and T. W. Septarianto, “Mapping thematic patterns in Indonesian novels through concept mining and computational linguistics,” Lingua Technica: Journal of Digital Literary Studies, vol. 2, no. 1, pp. 1–17, 2026, doi: 10.64595/lingtech.v2i1.128.
[26] M. Janebi Enayat, “Computationally derived linguistic features of L2 narrative essays and their relations to human-judged writing quality,” Language Testing in Asia, vol. 15, no. 1, 2025, doi: 10.1186/s40468-025-00374-9.
[27] S. M. Hassanin, E. M. Al Bayomy, and M. A. Eleleidy, “Leveraging Machine Learning and Natural Language Processing for Emotional and Thematic Analysis in Three Selected Contemporary English Novels,” Theory and Practice in Language Studies, vol. 15, no. 12, pp. 3833–3840, 2025, doi: 10.17507/tpls.1512.03.
[28] M. Pinkal and A. Koller, “Semantic research in computational linguistics,” in Semantics: An International Handbook of Natural Language Meaning volume 3, 2012, pp. 2825–2859.
[29] R. Adyanthaya and R. S. P, “YenLP_CS@DravidianLangTech 2025: Sentiment Analysis on Code-Mixed Tamil-Tulu Data Using Machine Learning and Deep Learning Models,” Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages, Jan. 2025, doi: 10.18653/v1/2025.dravidianlangtech-1.50.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Arif Nugroho (Author)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.










Creative Commons Attribution 4.0 International License