ConceptPose: Training-Free Zero-Shot Object Pose Estimation
using Concept Vectors


CVPR 2026
Liming Kuang1,2
Yordanka Velikova1,2
Mahdi Saleh1
Jan-Nico Zaech3
Danda Pani Paudel3
Benjamin Busam1,2,4
1Technical University of Munich
2Munich Center for Machine Learning
3INSAIT, Sofia University "St. Kliment Ohridski"
43dwe.ai

[Paper]
[Code]
[BibTeX]

TL;DR: We tag every 3D point with a concept vector — a semantic descriptor obtained from VLM saliency maps. By matching these vectors between an anchor and a query view, we estimate relative object pose without any training or object-specific models.


Abstract

Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero-shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.


Method


Overview of ConceptPose. Given an object, an LLM generates semantic concepts describing its visual parts and features. A VLM then produces GradCAM saliency maps for each concept across multiple views, which are back-projected into 3D to form concept vectors for each point. By matching these concept vectors between the reference model and the target observation, we establish robust 3D-3D correspondences that enable accurate 6DoF pose estimation via RANSAC.


Qualitative Results


Qualitative results of ConceptPose across multiple benchmarks. Our method produces accurate pose estimates on diverse objects without any dataset-specific training.


Quantitative Results

Relative pose estimation performance comparison on REAL275, Toyota-Light, YCB-Video, and LINEMOD. We report ADD(-S) and BOP AR scores. The (%) improvement over the strongest baseline is shown in the last row. TF indicates training-free methods. Average columns marked with denote averages over only REAL275 and TYOL for fair comparison with Any6D and One2Any, which were not evaluated on YCB-Video and LINEMOD.

Method TF REAL275 Toyota-Light YCB-Video LINEMOD Average
ADD(-S) BOP AR ADD(-S) BOP AR ADD(-S) BOP AR ADD(-S) BOP AR ADD(-S) BOP AR
SIFT 21.6 38.8 16.5 32.4 13.9 19.3 10.8 18.7 15.7 27.3
ObjectMatch 13.4 26.0 5.4 9.8 3.7 6.0 11.1 12.2 8.4 13.5
Oryon 34.9 46.5 22.9 34.1 12.8 19.4 20.4 25.3 22.8 31.3
Horyon 51.6 57.9 25.1 33.0 22.6 28.6 27.6 34.4 31.7 38.5
Any6D 53.5 51.0 32.2 43.3 -- -- -- -- (42.9) (47.2)
One2Any 41.0 54.9 34.6 42.0 -- -- -- -- (37.8) (48.5)
Ours 71.5 60.4 55.0 51.6 41.2 32.8 38.6 31.0 51.6 (63.3) 44.0 (56.0)
Δ(%) +33.6% +4.3% +59.0% +19.2% +82.3% +14.7% +39.9% -9.9% +62.8% +14.3%


Citation

If you find our work useful, please consider citing:

@inproceedings{kuang2026conceptpose,
    title={ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors},
    author={Kuang, Liming and Velikova, Yordanka and Saleh, Mahdi and Zaech, Jan-Nico and Paudel, Danda Pani and Busam, Benjamin},
    booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2026}
}


Template adapted from source.