On December 11, in Astana, the national language model KAZ-LLM was presented to the President of Kazakhstan, Kassym-Jomart Tokayev. The model was developed under the guidance of the Institute of Smart Systems and Artificial Intelligence (ISSAI NU) in partnership with Beeline Kazakhstan and its IT company QazCode, as well as Astana Hub. The project is coordinated by the Ministry of Digital Development, Innovations and Aerospace Industry of the Republic of Kazakhstan (MCDI API RK). The model holds strategic significance for the entire country as it addresses the issue of the language gap using AI.
How was the KAZ-LLM model developed?
KAZ-LLM from ISSAI is based on 150 billion tokens, meticulously collected from publicly available sources in four languages: Kazakh, Russian, English, and Turkish. This allows the model to demonstrate high accuracy and versatility, providing improved text processing quality across various languages and enhancing translation capabilities. Tokens refer to the smallest units of text, such as words, parts of words, or even individual characters, which AI uses to analyze and understand information.
The interface and functionality of the KAZ-LLM model were developed in line with the most advanced global standards, confirming the model's high technological maturity and broad potential. Comprehensive benchmarks with question-answer pairs covering various fields of knowledge were used to evaluate its performance. The benchmark package included the following tests:
- ARC (AI2 Reasoning Challenge) — assessment of scientific reasoning through multiple-choice questions.
- GSM8K — evaluation of the ability to solve elementary school math problems.
- HellaSwag — testing of logic in sentence continuation.
- MMLU (Massive Multitask Language Understanding) — assessment of knowledge across 57 different subjects.
- Winogrande — evaluation of common sense in ambiguous sentences.
- DROP — testing of reading comprehension and logical reasoning skills.
The partnership between Beeline and QazCode accelerated development
Key partners in its creation were Beeline Kazakhstan and its IT company QazCode, combining efforts and expertise in developing language models such as Kaz-RoBERTA, as well as in creating AI solutions for small language groups in collaboration with foreign partners. The support in the form of provided servers with computational power of 8 DGX H100 significantly accelerated the training process and expanded the model's capabilities. For comparison, a regular computer would take several days to analyze an archive of 1 million photos, whereas eight DGX H100 servers used for training the ISSAI KAZ-LLM can complete this task in just a few seconds.
Based on these servers, developers trained two versions of the model — with 8 billion and 70 billion parameters, with data scientists from QazCode joining the process.
“ Our team was actively involved in the development and training of the KAZ-LLM model. In creating the LLM, developers and partners utilized modern machine learning technologies such as PyTorch and Torchtune, while also considering the experience from previous projects aimed at adapting open-source LLM architectures for the Kazakh language. During the training, which lasted 50 days of continuous computations, the model improved its ability to understand context and provide high-quality interactions with users. Testing showed that the model effectively addresses technical tasks while taking into account the cultural and linguistic nuances of the Kazakh language ", - shared QazCode CEO Alexey Sharavar.
About the results and prospects of KAZ-LLM
Researchers note that the project represents an important milestone for Kazakhstan on the global artificial intelligence stage: “This model reflects Kazakhstan's commitment to innovation, independence, and the growth of its technological ecosystem. Our team prepared two versions of ISSAI KAZ-LLM with 8 billion and 70 billion parameters, built on the Meta Llama architecture and optimized for high-performance systems and resource-constrained environments. The models are released under the CC-BY-NC license, available for non-commercial use on the Hugging Face website, fostering global academic and research collaboration. Thus, developers will be able to download and run our model on both complex servers and laptops ", - said ISSAI director Professor NU Khusein Atakan Varol.
It is expected that ISSAI Kaz-LLM will open new opportunities for the creation of startups and innovative projects based on AI. Future plans include the development of next-generation models that will integrate language and visual data, significantly expanding AI capabilities. There are also considerations to add support for other Turkic languages, which will strengthen ties among Turkic-speaking communities.