Deep Video Understanding: A Stepping Stone Towards AGI

2024-09-27
  • 616

[Abstract]

Video has become one of the most popular modalities that modern individuals consume and produce. However, developing AI systems that deeply understand videos is still a challenging goal due to the difficulty of annotations, the sheer volume of data, and the substantial computational burden required for training and inference of video models. To address these problems, I introduce new strategies for pre-training and fine-tuning video foundation models, including parameter-efficient fine-tuning (PEFT). Additionally, to deploy video models to users, I present training-free cost-efficient inference techniques for video transformers. To demonstrate the generalizability of video foundation models, I highlight our recent work in ‘Video Question Answering’ which implicitly requires tackling various subtasks and achieving a deeper understanding of videos. Lastly, I discuss how Video QA and Multimodal QA systems can serve as stepping stones towards artificial general intelligence, and outline future research directions.

[Biography]

Hyunwoo J. Kim is an associate professor at Korea University, where he leads Machine Learning and Vision Lab (MLV). His lab focuses on developing techniques for general-purpose AI systems, including multimodal foundation models, multi-modal question answering, efficient inference, and new neural network architectures. Prior to this position, he worked at Amazon Lab126 in Sunnyvale, California. He obtained a Ph.D. in Computer Sciences at the University of Wisconsin-Madison (Ph.D minor in statistics). He served as an Area Chair for CVPR 2024 and recently co-organized the 1st and 2nd MICCAI workshops on Foundation Models for General Medical AI in 2023 and 2024.

LIST