6th International Conference on Problems of Cybernetics and Informatics, PCI 2025, Baku, Azerbaycan, 25 - 28 Ağustos 2025, (Tam Metin Bildiri)
Large language models (LLMs) offer high accuracy in natural language processing tasks. However, they face limitations in terms of portability and efficiency due to memory usage and computational costs. This situation restricts the applicability of these models, especially on edge devices and low-resource environments. This study proposes a quantization method inspired by biological neural systems to reduce the memory usage and computational costs of large language models (LLMs). The presented method determines channel importance levels based on Spike-Timing Dependent Plasticity (STDP) principles and adaptively quantizes the weights at two levels (INT8, FP32). In the experiments conducted on the GPT-2-medium model, the proposed STDP-inspired adaptive top-k quantization method achieved a 57% reduction in the size of the quantized MLP layers (Multi-Layer Perceptrons), resulting in a 33.3 % overall model size reduction. At the same time, no significant deterioration is observed in the language model performance. On the contrary, small improvements are achieved in perplexity and accuracy metrics. Moreover, the performance close to the original model is maintained in metrics such as FLOPs/s and latency. These results show that STDP-based adaptive quantization offers a strong alternative to traditional methods in terms of both efficiency and accuracy.