Abstract. Automatic speech recognition is a rapidly developing field in machine learning. The most popular speech recognition systems today are systems based on an integrated (end-to-end) architecture, and especially those models that directly output a sequence of words taking into account the input sound in real time, which are online end-to-end models. Speech streaming recognition allows you to transfer the audio stream to speech-to-text conversion and receive the results of speech recognition of the stream in real time as the audio is processed. In this article, a popular model based on RNN-T for recognition of Kazakh speech is considered and implemented. The analysis of works related to the recognition of Kazakh speech based on the CTC model is also given. The obtained results demonstrated that the RNN-T-based model can work well without additional components as a language model and showed the best result on our dataset. As a result of the conducted research, the system reached 10.6% CER, which is the best indicator among other integrated systems for recognition of Kazakh speech.
Keywords: Automatic speech recognition, end-to-end, RNN-T, CTC, sequence-to-sequence.