Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes.
In this paper, we introduce LoRD-HOI (Low-Rank Decomposed VLM Feature Adaptation for Zero-Shot HOI Detection), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, LoRD-HOI decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization.
Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting.
Overview of our proposed pipeline.
In the language branch, VLM text HOI features are decomposed into HOI basis features \( {\mathrm{B}}\) and HOI weights \( {\mathrm{W}}\). Similarly, action features are decomposed into action basis features \( {\mathrm{B}^a}\) and action weights \( {\mathrm{W}^a}\), where \( {\mathrm{B}^a}\) is selected from \( {\mathrm{B}}\). The weight adaptation updates \( {\mathrm{W}}\) with LLM-derived action regularization \( {\mathrm{W}^a}\), containing LLM-generated action information. The text fusion module combines action and object text features. In the vision branch, we adapt the VLM visual encoder with human-object tokens. Then we crop humans, objects, and HOI union regions from encoder output ("H, O, U" in the figure). The image fusion module then combines human and object features. The prediction combines vision and language branches.
Here are some qualitative results.
Here are critical experiment results in zero-shot HOI detection.
@inproceedings{lei2025lordhoi,
title = {LoRD-HOI: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation},
author = {Lei, Qinqian and Wang, Bo and Robby T., Tan},
booktitle = {In Proceedings of the IEEE/CVF international conference on computer vision},
year = {2025}
}