ConceptPose: Training-Free Zero-Shot Object Pose Estimation
using Concept Vectors

CVPR 2026

Liming Kuang^1,2

Yordanka Velikova^1,2

Mahdi Saleh¹

Jan-Nico Zaech³

Danda Pani Paudel³

Benjamin Busam^1,2,4

¹Technical University of Munich

²Munich Center for Machine Learning

³INSAIT, Sofia University "St. Kliment Ohridski"

⁴3dwe.ai

[Paper]

[Code]

[BibTeX]

TL;DR: We tag every 3D point with a concept vector — a semantic descriptor obtained from VLM saliency maps. By matching these vectors between an anchor and a query view, we estimate relative object pose without any training or object-specific models.

Abstract

Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero-shot relative pose estimation benchmarks, significantly outperforming existing methods by over 62% in ADD(-S) score, including those that utilize extensive dataset-specific training.

Method

Overview of ConceptPose. Given an object, an LLM generates semantic concepts describing its visual parts and features. A VLM then produces GradCAM saliency maps for each concept across multiple views, which are back-projected into 3D to form concept vectors for each point. By matching these concept vectors between the reference model and the target observation, we establish robust 3D-3D correspondences that enable accurate 6DoF pose estimation via RANSAC.

Qualitative Results

Qualitative results of ConceptPose across multiple benchmarks. Our method produces accurate pose estimates on diverse objects without any dataset-specific training.

Quantitative Results

Relative pose estimation performance comparison on REAL275, Toyota-Light, YCB-Video, and LINEMOD. We report ADD(-S) and BOP AR scores. The (%) improvement over the strongest baseline is shown in the last row. TF indicates training-free methods. Average columns marked with † denote averages over only REAL275 and TYOL for fair comparison with Any6D and One2Any, which were not evaluated on YCB-Video and LINEMOD.

Method	TF	REAL275		Toyota-Light		YCB-Video		LINEMOD		Average
Method	TF	ADD(-S)	BOP AR	ADD(-S)	BOP AR	ADD(-S)	BOP AR	ADD(-S)	BOP AR	ADD(-S)	BOP AR
SIFT	✗	21.6	38.8	16.5	32.4	13.9	19.3	10.8	18.7	15.7	27.3
ObjectMatch	✗	13.4	26.0	5.4	9.8	3.7	6.0	11.1	12.2	8.4	13.5
Oryon	✗	34.9	46.5	22.9	34.1	12.8	19.4	20.4	25.3	22.8	31.3
Horyon	✗	51.6	57.9	25.1	33.0	22.6	28.6	27.6	34.4	31.7	38.5
Any6D	✗	53.5	51.0	32.2	43.3	--	--	--	--	(42.9^†)	(47.2^†)
One2Any	✗	41.0	54.9	34.6	42.0	--	--	--	--	(37.8^†)	(48.5^†)
Ours	✓	71.5	60.4	55.0	51.6	41.2	32.8	38.6	31.0	51.6 (63.3^†)	44.0 (56.0^†)
Δ(%)		+33.6%	+4.3%	+59.0%	+19.2%	+82.3%	+14.7%	+39.9%	-9.9%	+62.8%	+14.3%

Citation

If you find our work useful, please consider citing:

@inproceedings{kuang2026conceptpose,
    title={ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors},
    author={Kuang, Liming and Velikova, Yordanka and Saleh, Mahdi and Zaech, Jan-Nico and Paudel, Danda Pani and Busam, Benjamin},
    booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2026}
}

Template adapted from source.