Finding NeMO 🐠
A Geometry-Aware Representation of Template Views for Few-Shot Perception


Sebastian Jung, Leonard Klüpfel, Rudolph Triebel, Maximilian Durner
Teaser image

Abstract

We present Neural Memory Object (NeMO), a novel object-centric representation that can be used to detect, segment and estimate the 6DoF pose of objects unseen during training using RGB images. Our method consists of an encoder that requires only a few RGB template views depicting an object to generate a sparse object-like point cloud using a learned UDF containing semantic and geometric information. Next, a decoder takes the object encoding together with a query image to generate a variety of dense predictions. Through extensive experiments, we show that our method can be used for few-shot object perception without requiring any camera-specific parameters or retraining on target data. Our proposed concept of outsourcing object information in a NeMO and using a single network for multiple perception tasks enhances interaction with novel objects, improving scalability and efficiency by enabling quick object onboarding without retraining or extensive pre-processing. We report competitive and state-ofthe-art results on various datasets and perception tasks of the BOP benchmark, demonstrating the versatility of our approach.

Neural Memory Objects

Our novel encoder combines the information stored in a set of images into a unified, geometrically understandable representation.

Architecture image

This combined representation allows for efficient inference speed, decoupled from the number of encoded template views.

Number of Template Images vs Average Precision and Memory Consumption.

Surface and Pose Estimation

Using the decoders object-centric pointmap prediction we can retrieve the camera pose relative to the object as well as its visible 3D surface.

Segmentation

The decoder predicts modal and amodal segmentations of objects never seen during training.

The videos are shown in realtime. From left to right: Camera stream with 6D pose visualization overlay, predicted pointmap, pointmap confidence, modal segmentation mask, amodal segmentation mask.

BibTeX

@inproceedings{jung2026,title={Finding NeMO: A Geometry-Aware Representation of Template Views for Few-Shot Perception},author={Sebastian Jung and Leonard Klüpfel and Rudolph Triebel and Maximilian Durner},journal={3DV},year={2026}}