How Many Labels Do We Need to Understand Pixels?
Dense prediction is a fundamental class of computer vision problems where the goal is to predict the per-pixel labels of an input image. Since any problems relating pixels to labels can fall into this class, it broadly encapsulates the majority of vision tasks, including semantic segmentation, object detection, pose estimation, and depth estimation, to name a few. Despite the remarkable progress in the past, however, training a model for dense prediction still remains challenging due to the cost of collecting per-pixel labels. A more desirable approach is to build a few-shot learner for dense prediction, yet the current solutions are limited to specific tasks such as segmentation.
This talk will discuss whether it is possible to build a universal few-shot learner that can learn arbitrary, unseen dense prediction tasks from a few labeled images (e.g., ten). The first part of the talk will discuss the challenges and desiderata of building a universal few-shot learner for dense prediction. Then I will present our approach based on non-parametric matching and demonstrate that it encapsulates all dense prediction tasks in principle and produces promising results in real-world data. Finally, I will conclude the talk with some limitations and promising directions for our research.
Seunghoon Hong is an assistant professor at the School of Computing, KAIST. Before joining KAIST, he had been a postdoctoral fellow at the University of Michigan and visiting research faculty at Google Brain team. His research interests lie in the intersection of machine learning and computer vision, with a specific focus on learning with least supervision and deep generative models. He received the B.S. and Ph.D. degree from the Department of Computer Science and Engineering at POSTECH, Pohang, Korea in 2011 and 2017, respectively.