The emerging research area of visual task planning attempts to learn representations suitable for planning directly from visual inputs, alleviating the need for accurate geometric models. Current methods commonly assume that similar visual observations correspond to similar states in the task planning space. However, observations from sensors are often noisy with several factors that do not alter the underlying state, for example, different lightning conditions, different viewpoints, or irrelevant background objects. These variations result in visually dissimilar images that correspond to the same task state. Achieving robust abstract state representations for real world tasks is an important research area that the ELPIS lab is focusing on.
Relevant Publications
arxiv
Learning Discrete Abstractions for Visual Rearrangement Tasks Using Vision-Guided Graph Coloring
Learning abstractions directly from data is a core challenge in robotics. Humans naturally operate at an abstract level, reasoning over high-level subgoals while delegating execution to low-level motor skills – an ability that enables efficient problem solving in complex environments. In robotics, abstractions and hierarchical reasoning have long been central to planning, yet they are typically hand-engineered, demanding significant human effort and limiting scalability. Automating the discovery of useful abstractions directly from visual data would make planning frameworks more scalable and more applicable to real-world robotic domains. In this work, we focus on rearrangement tasks where the state is represented with raw images, and propose a method to induce discrete, graph-structured abstractions by combining structural constraints with an attention-guided visual distance. Our approach leverages the inherent bipartite structure of rearrangement problems, integrating structural constraints and visual embeddings into a unified framework. This enables the autonomous discovery of abstractions from vision alone, which can subsequently support high-level planning. We evaluate our method on two rearrangement tasks in simulation and show that it consistently identifies meaningful abstractions that facilitate effective planning and outperform existing approaches.
@misc{ajith2025learningdiscreteabstractionsvisual,title={Learning Discrete Abstractions for Visual Rearrangement Tasks Using Vision-Guided Graph Coloring},author={Ajith, Abhiroop and Chamzas, Constantinos},year={2025},eprint={2509.14460},archiveprefix={arXiv},primaryclass={cs.RO},url={https://arxiv.org/abs/2509.14460}}
Learning state representations enables robotic planning directly from raw observations such as images. Most methods learn state representations by utilizing losses based on the reconstruction of the raw observations from a lower-dimensional latent space. The similarity between observations in the space of images is often assumed and used as a proxy for estimating similarity between the underlying states of the system. However, observations commonly contain task-irrelevant factors of variation which are nonetheless important for reconstruction, such as varying lighting and different camera viewpoints. In this work, we define relevant evaluation metrics and perform a thorough study of different loss functions for state representation learning. We show that models exploiting task priors, such as Siamese networks with a simple contrastive loss, outperform reconstruction-based representations in visual task planning.
@inproceedings{chamzas2022-contrastive-visual-task-planning,title={Comparing Reconstruction-and Contrastive-based Models for Visual Task Planning},author={Chamzas*, Constantinos and Lippi*, Martina and C. Welle*, Michael and Varava, Anastasia and E. Kavraki, Lydia and Kragic, Danica},booktitle={IEEE/RSJ International Conference on Intelligent Robots and Systems},month=oct,pages={12550-12557},doi={10.1109/IROS47612.2022.9981533},year={2022},url={https://doi.org/10.1109/IROS47612.2022.9981533}}
Recently, there has been a wealth of development in motion planning for robotic manipulationnew motion planners are continuously proposed, each with its own unique set of strengths and weaknesses. However, evaluating these new planners is challenging, and researchers often create their own ad-hoc problems for benchmarking, which is time-consuming, prone to bias, and does not directly compare against other state-of-the-art planners. We present MotionBenchMaker, an open-source tool to generate benchmarking datasets for realistic robot manipulation problems. MotionBenchMaker is designed to be an extensible, easy-to-use tool that allows users to both generate datasets and benchmark them by comparing motion planning algorithms. Empirically, we show the benefit of using MotionBenchMaker as a tool to procedurally generate datasets which helps in the fair evaluation of planners. We also present a suite of over 40 prefabricated datasets, with 5 different commonly used robots in 8 environments, to serve as a common ground for future motion planning research.
@article{chamzas2022-motion-bench-maker,title={MotionBenchMaker: A Tool to Generate and Benchmark Motion Planning Datasets},volume={7},number={2},pages={882–889},issn={2377-3766},doi={10.1109/LRA.2021.3133603},journal={IEEE Robotics and Automation Letters},author={Chamzas, Constantinos and Quintero-Pe{\~n}a, Carlos and Kingston, Zachary and Orthey, Andreas and Rakita, Daniel and Gleicher, Michael and Toussaint, Marc and E. Kavraki, Lydia},year={2022},month=apr,url={https://dx.doi.org/10.1109/LRA.2021.3133603}}
Representation learning allows planning actions directly from raw observations. Variational Autoencoders (VAEs) and their modifications are often used to learn latent state representations from high-dimensional observations such as images of the scene. This approach uses the similarity between observations in the space of images as a proxy for estimating similarity between the underlying states of the system. We argue that, despite some successful implementations, this approach is not applicable in the general case where observations contain task-irrelevant factors of variation. We compare different methods to learn latent representations for a box stacking task and show that models with weak supervision such as Siamese networks with a simple contrastive loss produce more useful representations than traditionally used autoencoders for the final downstream manipulation task.
@misc{chamzas2020rep-learning,author={Chamzas*, Constantinos and Lippi*, Martina and C. Welle*, Michael and Varava, Anastasiia and Alessandro, Marino and E. Kavraki, Lydia and Kragic, Danica},booktitle={NeurIPS, 3rd Robot Learning Workshop: Grounding Machine Learning Development in the
Real World},title={State Representations in Robotics: Identifying Relevant Factors of Variation using Weak
Supervision},year={2020},month=dec,url={https://www.robot-learning.ml/2020/}}