DeFM: Learning Foundation Representations from Depth for Robotics

1ETH Zurich, 2Stanford University, 3UC Berkeley, 4NVIDIA
Under Review

DeFM learns robust representations from depth images that transfer zero-shot across navigation, manipulation, and locomotion tasks, achieving state-of-the-art performance on diverse benchmarks.

Abstract

Depth sensors are widely deployed across robotic platforms, and advances in fast, high-fidelity depth simulation have enabled robotic policies trained on depth observations to achieve robust sim-to-real transfer for a wide range of tasks. Despite this, representation learning for depth modality remains underexplored compared to RGB, where large-scale foundation models now define the state of the art. To address this gap, we present DeFM, a self-supervised foundation model trained entirely on depth images for robotic applications. Using a DINO-style self-distillation objective on a curated dataset of 60M depth images, DeFM learns geometric and semantic representations that generalize to diverse environments, tasks, and sensors. To retain metric awareness across multiple scales, we introduce a novel input normalization strategy. We further distill DeFM into compact models suitable for resource-constrained robotic systems. When evaluated on depth-based classification, segmentation, navigation, locomotion, and manipulation benchmarks, DeFM achieves state-of-the-art performance and demonstrates strong generalization from simulation to real-world environments.

tl;dr

A DINO-style encoder, but for depth image inputs. Works on diverse robotics tasks without task-specific finetuning.

Highlights

First Depth Foundation Model for Robotics

Trained on 60M curated depth images using self-supervision, DeFM learns representations that generalize to diverse environments, tasks, and sensors.

State-of-the-Art Performance

SOTA performance across several tasks, including classification, segmentation, navigation, manipulation, and locomotion, without any task-specific fine-tuning.

Novel Metric-Aware Normalization

A novel 3-channel log-compressed input representation preserving metric depth across scales from millimeters to 100 meters.

Efficient Distilled Models

Includes 11 model variants ranging from 3M to 307M parameters, covering ViT-S/L and distilled CNNs like ResNet, EfficientNet, and RegNet.

Summary

DeFM overview

DeFM Overview: We present DeFM, a foundation model for depth images. Pretrained using DINOv2 style self-distillation method (III) on a curated depth dataset of 60 M images (II), DeFM features achieve state-of-the-art results across several classification and semantic segmentation benchmarks (linear probing) (I). Features obtained from DeFM reveal semantic awareness upon performing PCA despite depth lacking texture and color (V). We distill our largest DeFM model into several efficient CNN networks (IV) to be used for various downstream robotic Reinforcement Learning tasks, including navigation, manipulation, and locomotion (VI).

Method Overview

Emergent Semantic Understanding

Despite lacking color and texture, DeFM learns rich semantic representations. PCA visualization reveals consistent feature clustering across objects, sensors, and domains.

Navigation: Embodiment-Aware Long-Range Navigation

Navigation Performance

Habitat PointGoal (Gibson Val)

88.4% SPL

Embodiment-Aware (Training Env)

90.3% SR

OOD Environments (Avg)

84.3% SR

Key Insight: DeFM works out-of-the-box without task-specific depth preprocessing, achieving robust sim-to-real transfer and reducing collision failures in novel environments.

Manipulation: Dexterous Grasping

Dexterous Grasping Performance

Success Rate (Fine-tuned)

89.4%

Frozen Encoder Performance

80.9%

Noise Robustness

12% drop

(vs 28% for baselines)

Key Insight: DeFM features remain stable across sensor noise variations, with frozen encoder achieving 80.9% success and only 12% performance drop under novel Kinect noise.

Locomotion: Quadruped Ladder Climbing

Ladder Climbing Performance

Overall Success Rate

90.1%

Ladder Angle Range

70°-90°

Rung Radius Range

15-45mm

Key Insight: Frozen DeFM matches scratch-trained baseline (90.45%) with zero task-specific pretraining, while demonstrating better noise robustness for sim-to-real transfer.

Dense Prediction: Semantic Segmentation

Segmentation results across 5 datasets

Qualitative Results: DeFM-S/14 significantly outperforms DINOv3-S/16 on diverse segmentation tasks spanning indoor (ScanNet, SUN-RGBD), outdoor (OFFSED, TartanGround), and manipulation (GraspNet-1B) domains.

Segmentation Performance (mIoU)

Model ScanNet SUN-RGBD OFFSED TartanGround GraspNet-1B
DeFM ViT-L/14 31.34 31.26 57.62 67.69 27.85
DINOv3 ViT-L/16 28.52 32.74 54.42 62.16 23.89
DeFM ViT-S/14 27.69 27.78 57.35 64.66 19.89
DINOv3 ViT-S/16 20.05 18.42 56.32 56.97 14.87

Key Result: DeFM achieves state-of-the-art on 4/5 datasets. DeFM-S/14 shows up to 30% mIoU improvement over similar-sized baselines, demonstrating the critical need for depth-specific efficient foundation models.

Model Zoo: Efficient Variants for Robotics

11 model variants ranging from 3M to 307M parameters, optimized for diverse deployment scenarios.

Vision Transformers

Model Params FLOPs RTX 4090 (ms) Jetson Orin (ms) Use Case
ViT-L/14 307M 9962 G 625 72.8 Maximum accuracy, offline processing
ViT-S/14 22M 707 G 64 11.9 Balanced performance & speed

ResNet Family

Model Params FLOPs RTX 4090 (ms) Jetson Orin (ms) Use Case
ResNet-50 26M 631 G 69 17.8 Strong features, efficient training
ResNet-34 22M 494 G 33 13.5 Good balance for mid-range hardware
ResNet-18 12M 256 G 21 8.7 Lightweight, edge devices

RegNet Family

Model Params Orin (ms)
RegNetY-1.6GF 12M 41.8
RegNetY-800MF 6M 24.2
RegNetY-400MF 4M 25.2

EfficientNet Family

Model Params Orin (ms)
EfficientNet-B6 29M 54.1
EfficientNet-B4 14M 39.7
EfficientNet-B2 5M 28.4
EfficientNet-B0 3M 21.0

Note: All timings measured with PyTorch (BS=128 for RTX 4090, BS=1 for Jetson Orin, 224Ă—224 input). TensorRT/ONNX optimization can further reduce latency, especially on edge devices.

BibTeX

@article{patel2025defm,
  title={DeFM: Learning Foundation Representations from Depth for Robotics},
  author={Patel, Manthan and Frey, Jonas and Mittal, Mayank and Yang, Fan and Hansson, Alexander and Bar, Amir and Cadena, Cesar and Hutter, Marco},
  journal={Under Review},
  year={2025},
  url={https://leggedrobotics.github.io/defm}
}