Few-Shot Learning from Augmented Label-Uncertain Queries in Bongard-HOI
AAAI 2024

overview

This paper aims to recognize the human-object interaction in a few-shot setting.

Abstract

Detecting human-object interactions (HOI) in a few-shot setting remains a challenge. Existing meta-learning methods struggle to extract representative features for classification due to the limited data, while existing few-shot HOI models rely on HOI text labels for classification. Moreover, some query images may display visual similarity to those outside their class, such as similar backgrounds between different HOI classes. This makes learning more challenging, especially with limited samples. Bongard-HOI epitomizes this HOI few-shot problem, making it the benchmark we focus on in this paper. In our proposed method, we introduce novel label-uncertain query augmentation techniques to enhance the diversity of the query inputs, aiming to distinguish the positive HOI class from the negative ones. As these augmented inputs may or may not have the same class label as the original inputs, their class label is unknown. Those belonging to a different class become hard samples due to their visual similarity to the original ones. Additionally, we introduce a novel pseudo-label generation technique that enables a mean teacher model to learn from the augmented label-uncertain inputs. We propose to augment the negative support set for the student model to enrich the semantic information, fostering diversity that challenges and enhances the student's learning. Experimental results demonstrate that our method sets a new state-of-the-art (SOTA) performance by achieving 68.74% accuracy on the Bongard-HOI benchmark, a significant improvement over the existing SOTA of 66.59%. In our evaluation on HICO-FS, a more general few-shot recognition dataset, our method achieves 73.27% accuracy, outperforming the previous SOTA of 71.20% in the 5-way 5-shot task.

Framework

overview

Overview of our framework: The novel query (bottom right image) is created by merging the highly representative positive background with the negative HOI foreground, or conversely. The newly generated query is then fed into both the teacher and student models. The teacher network is the exponential moving average (EMA) of the student. Throughout the mean teacher training, the student model processes both the augmentations of images selected from the negative support set and the images from the negative support set that remain unchosen for augmentation and learns from the predictions made by the teacher (pseudo-label). Due to space constraints, 3 out of 6 positive and negative support images, one original query, and one newly generated query are displayed. The term "motor." denotes "motorcycle".

Qualitative Results

Specific Tasks

overview

T-SNE Visualization

overview

Video

Coming soon...

Citation