Abstract. The growing number of spam messages in digital communication highlights the urgent need for effective spam detection systems, particularly for languages that lack sufficient digital resources, such as Kazakh. This research aims to develop a machine learning-based approach tailored for spam detection in Kazakh messages, utilizing various text preprocessing techniques and methods to enhance model performance.
The primary objective of this study is to evaluate the effectiveness of the Multinomial Naive Bayes algorithm in classifying spam and non-spam messages within a dataset composed of 200 manually labeled samples. The methodology involves several essential steps, including data collection, preprocessing to clean and normalize the text, and feature extraction to transform the messages into a suitable format for analysis.
The findings reveal that the proposed model achieves an impressive accuracy rate of 95%, demonstrating its potential for effective spam detection in the Kazakh language. This work significantly contributes to addressing the existing gap in spam detection resources specifically designed for the Kazakh-speaking community. The practical implications of the results are considerable, as they can inform the development of more sophisticated spam filtering systems, thereby enhancing user experience and security in digital communications. Moreover, theoretical significance lies in its contribution to the fields of natural language processing and machine learning, encouraging further research and development of algorithms and techniques applicable to underrepresented languages. The study outlines text processing steps to enhance spam detection accuracy in Kazakh messages improving machine learning models’ ability to identify patterns.
Keywords: spam detection, TF-IDF, Multinomial Naive Bayes, kazakh language, spam prediction, machine learning.