InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

1Max Planck Institute for Intelligent Systems, Tübingen, Germany, 2University of Amsterdam, the Netherlands, 3Inria, École normale supérieure, France
Teaser

We present InteractVLM, a novel method for estimating contact points on both human bodies and objects from a single in-the-wild image, represented as red patches. Our method goes beyond traditional binary contact estimation methods by estimating contact points on a human in relation to a specified object. We do so by leveraging the broad visual knowledge of a large Visual Language Model.

Abstract

Estimating the 3D pose and shape of interacting humans and objects from single in-the-wild images is important for mixed reality and robotics. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing work tackles these challenges by exploiting surface contact points on the body and object and using these to guide 3D reconstruction. Unfortunately, obtaining 3D contact annotations requires either expensive 3D ground truth or time-consuming manual labeling. Consequently, obtaining training data at scale is a challenge. We tackle this by developing a novel model called InteractVLM that harnesses the broad visual knowledge of large Visual-Language Models (VLMs). The problem is, however, that these large models do not directly “understand” 3D human-object contact. To address this, we exploit existing small datasets of 3D human-object interaction to fine-tune large models to understand contact. However, this is non-trivial, as such models reason “only” in 2D, while contact is inherently 3D. Thus, we introduce a novel “Render-Localize-Lift” module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. This lets InteractVLM infer 3D contacts for both bodies and objects from a single in-the-wild image. InteractVLM outperforms existing work on contact estimation and also facilitates 3D reconstruction from an in-the-wild image. To estimate 3D human and object pose, we infer initial body and object meshes, then infer contacts on both of these via InteractVLM, and lastly exploit these in fitting human and object meshes to image evidence. Results show that our approach performs promisingly in the wild. Our code and models will be released.

Method

Teaser

(a) Given a single in-the-wild color image, our novel InteractVLM method estimates 3D contact points on both humans and objects (a). (b) Then, we reconstruct a 3D human and object in interaction by exploiting these contacts.

HOI Reconstruction

Results

We build an optimization method that fits a SMPL-X body and OpenShape-retrieved object to an in-the-wild image. We evaluate against the SotA method PHOSA, ECCV 2020. Reconstruction is guided by InteractVLM-inferred contacts.