ContactArt: Learning 3D Interaction Priors for Category-level Articulated Object and Hand Poses Estimation

3DV 2024 (Oral)

1University of Texas at Austin, 2Carnegie Mellon University, 3UC San Diego, 4Google Research

Our method estimates articulated object and hand poses of real-world data, by learning hand-object interaction priors from our proposed ContactArt dataset.

We visualize more ContactArt data on our explorer.

Abstract

We propose a new dataset and a novel approach to learn hand-object interaction priors for hand and articulated object poses estimation. We first collect a dataset using visual teleoperation, where the human operator can directly play within a physical simulator to manipulate the articulated objects. We record the data and obtain the free and accurate annotations on object poses and contact information from the simulator. Our system only requires an iPhone to record human hand motion, which can be easily scaled up and largely lower the costs on data and annotation collection. With this data, we learn 3D interaction priors including a discriminator (in a GAN) capturing the distribution of how object parts are arranged, and a diffusion model which generates the contact regions on an articulated objects, guiding the hand pose estimation. Such structural and contact priors can easily transfer to the real-world data with barely any domain gap. By using our data and learned priors, our method significantly improves the performance on joint hand and articulated object poses estimation over existing state-of-the-arts.

Video

Hand and Articulated object poses estimation on HOI4D dataset

We present a new hand object interaction dataset named ContactArt and learning pipeline for hand and articulated object poses estimation. We train our method on the proposed ContactArt dataset and test our method on real-world HOI4D dataset. Our method learns the 3D interaction priors from large-scale dataset. The priors can easily transfer to the real and generalize to novel objects.

Articulated object poses estimation on RBO and BMVC dataset

Our method can also estimate accurate object pose on other real-world datasets, such as RBO and BMVC dataset. Some examples are shown below.

Test time adaption using discriminator

Our method learns the articulation prior with a discriminator during training. At test time, we feed predicted articulation structures to the discriminator with fixed paramaters. The we calculate the adversarial loss and back-propagate the gradients to update the estimator. We iterate this process to improve the estimator and can predict better articulated object pose.

Optimize hand using contact diffusion model

Our method learns the contact prior with a contact diffusion model. At test time, we utilize the contact diffusion model to predict the contact regions at objects. Then we optimize hand pose by minimizing the distance between hand vertices and the contact points at the predicted contact regions. In this way, we can achieve better performance on hand pose estimation.