Optimizing Attention for Efficient LLM Inference: A Review
Sun, Siyuan; Yu, Jinling; Liu, Hanshuo; Guo, Hanyun; Cao, Yang; Zhang, Shouhua; Zhou, Jiehan (2025-06-13)
Sun, Siyuan
Yu, Jinling
Liu, Hanshuo
Guo, Hanyun
Cao, Yang
Zhang, Shouhua
Zhou, Jiehan
IEEE
13.06.2025
S. Sun et al., "Optimizing Attention for Efficient LLM Inference: A Review," 2025 8th World Conference on Computing and Communication Technologies (WCCCT), Shenzhen, China, 2025, pp. 482-491, doi: 10.1109/WCCCT65447.2025.11027973
https://rightsstatements.org/vocab/InC/1.0/
© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works.
https://rightsstatements.org/vocab/InC/1.0/
© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works.
https://rightsstatements.org/vocab/InC/1.0/
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:oulu-202507015041
https://urn.fi/URN:NBN:fi:oulu-202507015041
Tiivistelmä
Abstract
The rapid advancement of deep learning has led to significant progress in large language models (LLMs), with the Attention mechanism serving as a core component of their success. However, the computational and memory demands of Attention mechanisms pose bottlenecks for efficient inference, especially in long-sequence and real-time tasks. This paper systematically reviews optimization strategies for Attention mechanisms, including sparse attention, low-rank decomposition, quantization techniques, block-based parallel computation, and memory management. These approaches have demonstrated notable improvements in reducing computational complexity, optimizing memory usage, and enhancing inference performance. This review highlights the key challenges of computational efficiency, long-sequence modeling, and cross-task generalization through an in-depth analysis of existing methods, their advantages, and limitations. Future research directions, including dynamic precision, hardware-aware optimization, and lightweight architectures offer insights for advancing LLM inference theory and practice.
The rapid advancement of deep learning has led to significant progress in large language models (LLMs), with the Attention mechanism serving as a core component of their success. However, the computational and memory demands of Attention mechanisms pose bottlenecks for efficient inference, especially in long-sequence and real-time tasks. This paper systematically reviews optimization strategies for Attention mechanisms, including sparse attention, low-rank decomposition, quantization techniques, block-based parallel computation, and memory management. These approaches have demonstrated notable improvements in reducing computational complexity, optimizing memory usage, and enhancing inference performance. This review highlights the key challenges of computational efficiency, long-sequence modeling, and cross-task generalization through an in-depth analysis of existing methods, their advantages, and limitations. Future research directions, including dynamic precision, hardware-aware optimization, and lightweight architectures offer insights for advancing LLM inference theory and practice.
Kokoelmat
- Avoin saatavuus [38841]