Architecture and Systems for Distributed Deep Learning

  • 571


How many GPUs do you use for training deep learning models? The answer is usually one, and rarely any more even if there are a number of GPUs at your disposal in the lab server room. In fact, it turns out that it is not so simple to utilize multiple GPUs, and there are a lot of research opportunities for distributed training. I will start with some basics of distributed deep learning. Then, I will introduce my recent research effort on system and architectural optimizations on it. The talk will cover some issues at the level of single server [1,2,3] to a few servers [4], and hundreds of servers [5].
[1] GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent, HPCA 2021
[2] Pipe-BD: Pipelined Parallel Blockwise Distillation, DATE 2023
[3] Fast Adversarial Training with Dynamic Batch-level Attack Control, DAC 2023
[4] FlexReduce: Flexible All-reduce for Distributed Deep Learning on Asymmetric Network Topology, DAC 2020
[5] Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression, ASPLOS 2023


Jinho Lee received all B.S., M.S., and Ph.D. degrees in electrical engineering and computer science from Seoul National University, Seoul, South Korea, in 2009, 2011, and 2016, respectively.
He is currently an assistant professor at the Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea.
His current research interests include architectures and system optimizations for deep learning.