세미나안내
Towards Cost-Effective LLM Serving Systems
- 등록일2026.05.11
- 조회수646
-

세미나 일정2026.05.22 FRI
-

연사안정섭 교수(고려대학교)
[Abstract]
Serving large language models (LLMs) in real-world applications such as chatbots and assistants poses critical challenges in both efficiency and cost. This talk presents two research directions that address these challenges from different layers of the system stack.I will begin with FlashGen a framework designed for multi-turn dialogue workloads. Existing LLM serving frameworks suffer from excessive recomputation of attention keys and values (KVs) and inefficient scheduling under long contexts. First FlashGen tackles the computational inefficiency with a multi-level KV cache spanning GPU CPU and SSD. Second I will introduce a scheduling technique that mitigates head-of-line blocking in long context. These techniques allow FlashGen to significantly improve GPU utilization yielding up to 2.85× throughput gains for large models like Llama-2 70B while maintaining latency comparable to state-of-the-art systems. Next I will discuss our prototype of a GPU-centric memory tiering system on the NVIDIA Grace Hopper Superchip comprising a performance tier with GPU memory and a capacity tier with host memory. Unlike conventional offloading approaches such as FlexGen primary optimized for throughput-oriented offline batch processing our design targets latency-critical serving. By jointly managing model parameters and KV tensors with pipelined offloading and a lossless compression scheme we show that GPU-centric memory tiering can deliver a cost-effective solution in LLM serving.
[Biography]
Jeongseob Ahn is currently an associate professor in the School of Electrical Engineering (Computer Division) at Korea University. His research interests lie in building efficient computer systems. Before joining Korea University, he served as a faculty member at Ajou University for six and a half years. Prior to that, he worked at Oracle Labs, where he contributed to the development of a large-scale data analytics system with specialized hardware, known as RAPID. He also spent one year at the University of Michigan, Ann Arbor. He received his PhD in Computer Science from KAIST in 2015, and earned his BS in Computer Science and Engineering from Dongguk University in 2009.
- 첨부파일
- 세미나 포스터_안정섭교수.jpg



