

Deep Video Understanding: A Stepping Stone Towards AGI

  • 281
김현우 교수(고려대학교) / 2024.10.11


Video has become one of the most popular modalities that modern individuals consume and produce. However, developing AI systems that deeply understand videos is still a challenging goal due to the difficulty of annotations, the sheer volume of data, and the substantial computational burden required for training and inference of video models. To address these problems, I introduce new strategies for pre-training and fine-tuning video foundation models, including parameter-efficient fine-tuning (PEFT). Additionally, to deploy video models to users, I present training-free cost-efficient inference techniques for video transformers. To demonstrate the generalizability of video foundation models, I highlight our recent work in ‘Video Question Answering’ which implicitly requires tackling various subtasks and achieving a deeper understanding of videos. Lastly, I discuss how Video QA and Multimodal QA systems can serve as stepping stones towards artificial general intelligence, and outline future research directions.


Hyunwoo J. Kim is an associate professor at Korea University, where he leads Machine Learning and Vision Lab (MLV). His lab focuses on developing techniques for general-purpose AI systems, including multimodal foundation models, multi-modal question answering, efficient inference, and new neural network architectures. Prior to this position, he worked at Amazon Lab126 in Sunnyvale, California. He obtained a Ph.D. in Computer Sciences at the University of Wisconsin-Madison (Ph.D minor in statistics). He served as an Area Chair for CVPR 2024 and recently co-organized the 1st and 2nd MICCAI workshops on Foundation Models for General Medical AI in 2023 and 2024.
