|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| TL;DR: We tag every 3D point with a concept vector — a semantic descriptor obtained from VLM saliency maps. By matching these vectors between an anchor and a query view, we estimate relative object pose without any training or object-specific models. |
|
| Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero-shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training. |
|
|
Overview of ConceptPose. Given an object, an LLM generates semantic concepts describing its visual parts and features. A VLM then produces GradCAM saliency maps for each concept across multiple views, which are back-projected into 3D to form concept vectors for each point. By matching these concept vectors between the reference model and the target observation, we establish robust 3D-3D correspondences that enable accurate 6DoF pose estimation via RANSAC. |
|
|
Qualitative results of ConceptPose across multiple benchmarks. Our method produces accurate pose estimates on diverse objects without any dataset-specific training. |
| Relative pose estimation performance comparison on REAL275, Toyota-Light, YCB-Video, and LINEMOD. We report ADD(-S) and BOP AR scores. The (%) improvement over the strongest baseline is shown in the last row. TF indicates training-free methods. Average columns marked with † denote averages over only REAL275 and TYOL for fair comparison with Any6D and One2Any, which were not evaluated on YCB-Video and LINEMOD. |
| Method | TF | REAL275 | Toyota-Light | YCB-Video | LINEMOD | Average | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| ADD(-S) | BOP AR | ADD(-S) | BOP AR | ADD(-S) | BOP AR | ADD(-S) | BOP AR | ADD(-S) | BOP AR | ||
| SIFT | ✗ | 21.6 | 38.8 | 16.5 | 32.4 | 13.9 | 19.3 | 10.8 | 18.7 | 15.7 | 27.3 |
| ObjectMatch | ✗ | 13.4 | 26.0 | 5.4 | 9.8 | 3.7 | 6.0 | 11.1 | 12.2 | 8.4 | 13.5 |
| Oryon | ✗ | 34.9 | 46.5 | 22.9 | 34.1 | 12.8 | 19.4 | 20.4 | 25.3 | 22.8 | 31.3 |
| Horyon | ✗ | 51.6 | 57.9 | 25.1 | 33.0 | 22.6 | 28.6 | 27.6 | 34.4 | 31.7 | 38.5 |
| Any6D | ✗ | 53.5 | 51.0 | 32.2 | 43.3 | -- | -- | -- | -- | (42.9†) | (47.2†) |
| One2Any | ✗ | 41.0 | 54.9 | 34.6 | 42.0 | -- | -- | -- | -- | (37.8†) | (48.5†) |
| Ours | ✓ | 71.5 | 60.4 | 55.0 | 51.6 | 41.2 | 32.8 | 38.6 | 31.0 | 51.6 (63.3†) | 44.0 (56.0†) |
| Δ(%) | +33.6% | +4.3% | +59.0% | +19.2% | +82.3% | +14.7% | +39.9% | -9.9% | +62.8% | +14.3% | |
If you find our work useful, please consider citing:
@inproceedings{kuang2026conceptpose,
title={ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors},
author={Kuang, Liming and Velikova, Yordanka and Saleh, Mahdi and Zaech, Jan-Nico and Paudel, Danda Pani and Busam, Benjamin},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
|