Abstract. A rule-based approach is presented for the morphological study and production of Kazakh, a highly agglutinative and morphologically complicated language. Computational modeling of Kazakh morphology demands an exact and methodical approach due to the language’s great use of affixation and phonological alternations such vowel harmony and consonant mutation. The main technology is finite-state transducers (FSTs), which provide both formal rigor and computing efficiency for faithfully capturing the regular patterns of word building. Two main components define the system: a morphological generator building well-formed surface variants from abstract morphological representations; a morphological analyzer separating surface word forms into root and affixes with related grammatical properties. For nominal and verbal paradigms including tense, mood, aspect, person, number, and case the FST architecture codes morphotactic rules, phonological constraints, and affix ordering. To support the transducer-based analysis, a thorough lexicon of Kazakh lemmas is constructed and arranged according to portion of speech. Covering both inflectional and derived morphology, the handmade morphological rules represent the linguistic structure of the language. High accuracy in both analysis and creation tasks is obtained via evaluation on a manually annotated corpus of modern Kazakh writings.
Part-of-speech tagging, syntactic parsing, and machine translation are just a few of the downstream natural language processing uses for which the resultant tool is a basic component. Released as an open-source module to allow more general use and additional study in Kazakh computational linguistics, the system is a contribution to the development of language technology for low-resourced languages.
Keywords: Kazakh language, morphological analysis, morphological generation, finite-state transducers, agglutinative languages, natural language processing, rule-based systems.