CLIPGraphs

CLIPGraphs: Multimodal Graph Networks to Infer Object-Room Affinities

¹Robotics Research Center, IIIT Hyderabad, India ²TCS Research, Tata Consultancy Services, India
³MIT CSAIL
⁴Intelligent Robotics Lab, University of Birmingham, UK
*Equal Contribution

Abstract

This paper introduces a novel method for determining the best room to place an object in, for embodied scene rearrangement. While state-of-the-art approaches rely on large language models (LLMs) or reinforcement learned (RL) policies for this task, our approach, CLIPGraphs, efficiently combines commonsense domain knowledge, data-driven methods, and recent advances in multimodal learning. Specifically, it (a) encodes a knowledge graph of prior human preferences about the room location of different objects in home environments, (b) incorporates vision-language features to support multimodal queries based on images or text, and (c) uses a graph network to learn object-room affinities based on embeddings of the prior knowledge and the vision-language features. We demonstrate that our approach provides better estimates of the most appropriate location of objects from a benchmark set of object categories in comparison with state-of-the-art baselines

Dataset

The IRONA 30 web-scraped images for each of the 268 object categories used by Housekeep. With over 8000 images spanning over 268 common household item object categories, the IRONA dataset serves as a diverse and reliable dataset for our approach.

The above image is small subset of the IRONA dataset.

BibTeX

@misc{agrawal2023clipgraphs, title={CLIPGraphs: Multimodal Graph Networks to Infer Object-Room Affinities, author={Ayush Agrawal and Raghav Arora and Ahana Datta and Snehasis Banerjee and Brojeshwar Bhowmick and Krishna Murthy Jatavallabhula and Mohan Sridharan and Madhava Krishna}}, year={2023}, eprint={2306.01540}, archivePrefix={arXiv}, primaryClass={cs.RO}}

CLIPGraphs: Multimodal Graph Networks to Infer Object-Room Affinities

We introduce CLIPGraphs, a method to leverage the semantic consistency in organization and incorporate it into embodied AI agents.

Abstract

Dataset

BibTeX