DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

¹Meta Reality Labs, ²Department of Computer Science, ETH Zurich

*Work done during an internship at Meta Reality Labs

Accepted to SIGGRAPH Asia 2024

Abstract

Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. In this paper, we propose a novel method, dubbed DiffH2O, which can synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and an text-based manipulation stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the manipulation phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses and helps in generating realistic hand-object interactions. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the manipulation phase. For the textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions. Moreover, we demonstrate the practicality of our framework by utilizing a hand pose estimate from an off-the-shelf pose estimator for guidance, and then sampling multiple different actions in the manipulation stage.

Examples of Unseen Objects

"The person picks up the large torus with their right hand, inspects it at chest level with both hands, and finally places it on the table using their right hand."

"The person picks up the elephant using their left hand and inspects it by rotating it in every direction."

"The person uses their right hand to pick up the train from the high table, passes it to their left hand, then passes it to another person on the left side using their left hand, and finally puts it back on the table using their right hand."

"The person picks up the mug with their right hand, lifts it to their mouth, drinks from it while holding it with their right hand, and then places it on the table with their right hand."

"The person picks up the medium cylinder with their right hand, passes it to their left hand, then passes it to someone on their left side with their left hand, and finally places it on the table with their right hand."

"The person picks up the apple from a high table using their right hand, passes it to someone on their right at chest level using their right hand, and finally places it on the table using the same hand."

Examples of Seen Objects

"The person lifts the large cube."

"The person looks through the binoculars."

"The person inspects the stanford bunny."

"The person flies the toy airplane."

"The person plays with the game controller."

"The person lifts the large sphere"

"The person uses the stamp."

"The person inspects the small cylinder."

Text Descriptions for GRAB

We provide carefully annotated text descriptions for the GRAB dataset (Taheri et. al, 2020).

"The person picks up the apple from the table, eats the apple by taking several bites, and then places it back on the table using their right hand."

"The person uses their right hand to pick up the banana from the table, then peels it with their left hand, takes some bites, and finally places it back on the table with their right hand."

"The person picks up the binoculars with the right hand and then switches to holding them with both hands, looks around through the binoculars while bending their upper body and then places them back down on the table using the right hand."

"The person picks up the cup from the table using their right hand, pours the liquid inside several times onto the table, and finally places the cup back onto the table using their right hand."

"The person grabs the mouse from the table with their right hand and then uses it by moving it around the table and clicking several times."

"The person picks up the medium pyramid using both hands, with their right hand at the tip and left hand at the bottom, investigates it with their right hand at neck level, and then places it on the table with their right hand."

BibTeX

@inproceedings{christen2024diffh2o, title={DiffH2O: Diffusion-based synthesis of hand-object interactions from textual descriptions}, author={Christen, Sammy and Hampali, Shreyas and Sener, Fadime and Remelli, Edoardo and Hodan, Tomas and Sauser, Eric and Ma, Shugao and Tekin, Bugra}, booktitle={SIGGRAPH Asia 2024 Conference Papers}, year={2024} }

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

Abstract

Examples of Unseen Objects

"The person picks up the large torus with their right hand, inspects it at chest level with both hands, and finally places it on the table using their right hand."

"The person picks up the elephant using their left hand and inspects it by rotating it in every direction."

"The person uses their right hand to pick up the train from the high table, passes it to their left hand, then passes it to another person on the left side using their left hand, and finally puts it back on the table using their right hand."

"The person picks up the mug with their right hand, lifts it to their mouth, drinks from it while holding it with their right hand, and then places it on the table with their right hand."

"The person picks up the medium cylinder with their right hand, passes it to their left hand, then passes it to someone on their left side with their left hand, and finally places it on the table with their right hand."

"The person picks up the apple from a high table using their right hand, passes it to someone on their right at chest level using their right hand, and finally places it on the table using the same hand."

Examples of Seen Objects

"The person lifts the large cube."

"The person looks through the binoculars."

"The person inspects the stanford bunny."

"The person flies the toy airplane."

"The person plays with the game controller."

"The person lifts the large sphere"

"The person uses the stamp."

"The person inspects the small cylinder."

Video

Text Descriptions for GRAB

"The person picks up the apple from the table, eats the apple by taking several bites, and then places it back on the table using their right hand."

"The person uses their right hand to pick up the banana from the table, then peels it with their left hand, takes some bites, and finally places it back on the table with their right hand."

"The person picks up the binoculars with the right hand and then switches to holding them with both hands, looks around through the binoculars while bending their upper body and then places them back down on the table using the right hand."

"The person picks up the cup from the table using their right hand, pours the liquid inside several times onto the table, and finally places the cup back onto the table using their right hand."

"The person grabs the mouse from the table with their right hand and then uses it by moving it around the table and clicking several times."

"The person picks up the medium pyramid using both hands, with their right hand at the tip and left hand at the bottom, investigates it with their right hand at neck level, and then places it on the table with their right hand."

Diversity Evaluations

Different Objects (Seen), Same Text Prompt

Different Objects (Unseen), Same Text Prompt

Same Object, Same Text Prompts

Same Object, Different Text Prompts

Comparison to Baseline

IMoS (Ghosh et. al 2023)

Ours

IMoS (Ghosh et. al 2023)

Ours

Generation with Image-Based Pose Estimate

Source Image

Generated Sequences

Source Image

Generated Sequences

BibTeX

Different Objects (Seen),
Same Text Prompt

Different Objects (Unseen),
Same Text Prompt

Same Object,
Same Text Prompts

Same Object,
Different Text Prompts