CoRL 2026 · Dataset & Benchmark Track

Do Embodied AI Models Have Physical Intuition?
An Egocentric Probing Benchmark

An offline representation benchmark and reproducible visuo-tactile data engine that probes whether Vision-Language-Action models develop human-like physical intuition.

Anonymous Authors1
1 Anonymous Institution  ·  Under review
🥚

Physical Boundary Probing

Boundary samples with deliberate failure modes expose safety-limit prediction in VLA models

🥤

Visual-Physical Mismatch

Empty milk carton deception reveals whether models build true visual force priors or just mimic

🌊

Visual Feedforward Prior

Water-level experiments test proportional force scaling as evidence of physical intuition

TL;DR

We introduce FEEL, an offline benchmark testing whether VLA models develop human-like physical intuition. By designing controlled water-level prior, boundary, and visual-mismatch experiments, we show that generative policies (Flow Matching) outperform discrete token-based models on continuous force dynamics— and that tactile sensing is indispensable when vision alone is deceived.

Overview

When humans interact with familiar objects, they rely on visual cues to anticipate physical properties—such as weight, stiffness, and deformability—and adjust their force strategies before contact. However, while current Vision-Language-Action (VLA) models excel at semantic understanding and action generation, it remains an open question whether their representations can transfer to predicting continuous contact forces.

To investigate this, we introduce FEEL (Force Intuition from Egocentric Experience Learning), a controlled egocentric force-intuition probing benchmark. FEEL pairs first-person RGB video and precise 6-DoF hand poses with synchronized tactile signals. Crucially, it introduces boundary samples and visual-physical mismatch samples to probe force estimation when visual priors are unreliable. In total, the benchmark comprises 3.6 hours of synchronized vision-touch-pose data covering 7 object categories within the Fragile, Deformable, and Variable-Load taxonomy.

We systematically evaluate seven representative VLA baselines (ACT, WALL-X, Pi0, Pi0.5, Pi0-FAST, X-VLA, SmolVLA) and our own model family. Massive embodied backbones often yield low or negative R² on deformable objects (e.g., X-VLA R²=0.108 overall, Pi0-FAST R²=−4.574), performing worse than a lightweight ACT baseline. We show that combining three physical inductive biases—finger-distance-conditioned state encoding, GRU temporal modeling, and an uncertainty-aware NLL head with auxiliary contact loss—and routing them through an Adaptive Mixture-of-Experts head (EgoForce) yields the strongest model overall: R²=0.703 and MAE=4.550 N at ~37M parameters, outperforming every billion-parameter foundation model on every material category.

The Physical Intuition Gap

Human manipulation relies on internal sensorimotor models that generate visual feedforward priors before contact. Current VLA models lack this capacity.

🧠

Human Internal Model

Feedforward + Feedback

Cognitive science shows that humans employ visual feedforward priors before contact. Seeing a full cup triggers anticipatory scaling of grasp force. A deceptive empty milk carton causes measurable force overshoot—proof of visual prior formation.

Feedforward: Initial force ∝ visual water level
Feedback: Tactile correction after contact transient
🤖

Current VLA Models

Vision-Only · No Physical Prior

Models like RT-2 and OpenVLA excel at semantic understanding and instruction following, yet they lack explicit physical grounding. They may correctly pick up a potato chip, but without force intuition, the result is often catastrophic failure at safety boundaries.

Problem 1: No force scaling from visual cues
Problem 2: Brittle under visual-physical mismatch
Core Scientific Question

"Can current VLA models learn this human-like physical intuition, and under what circumstances do they struggle?"

The FEEL Benchmark & Data Engine

A reproducible visuo-tactile collection setup with hardware blueprints and spatio-temporal synchronization algorithms.

Data Collection Pipeline

👓

Meta Aria Glasses

6-DoF wrist & finger poses via MPS · mm accuracy

Paxini PX6AX Tactile Sensor

PX6AX-GEN3-DP-S2015-Elite on thumb & index · high-res fingertip force / deformation

⏱️

Spatio-Temporal Sync

Aligns high-freq visual/IMU streams with tactile readings

📐

3D-Print Blueprint

Open-source STL + BOM for low-cost sensor mount

📦

FEEL Dataset

F/D/V taxonomy · boundary & mismatch samples

F/D/V Physical Taxonomy

F

Fragile

Brittle failure modes requiring precise force thresholding to avoid catastrophic damage.

Potato chipsCrackers
D

Deformable

Elastic/plastic behaviors requiring visual deformation-to-force mapping.

SpongeBreadBottles
V

Variable-Load

Objects whose effective weight varies with content or fill level, requiring force estimation under a changing, visually-inferred load.

Water cupsRice bagsMilk cartons

Three Controlled Probe Types

⚠️

Boundary Samples

Demonstrators deliberately approach or exceed unsafe force limits (e.g., crushing a potato chip). Tests safety-boundary prediction.

🌊

Water-Level Prior Samples

Cups filled at 0%, 25%, 50%, 75%, 100% capacity. Tests whether initial force scales with visual perception of weight.

🎭

Visual-Physical Mismatch

Visual appearance conflicts with true mass (empty can). Tests force overshoot and human-like "size-weight illusion" reproduction.

Egocentric Footage

First-person RGB clips from three experiment categories, captured with Meta Aria glasses and synchronized tactile sensors.

Technical Contributions for Force Learning

Three targeted contributions that together improve continuous contact force prediction in egocentric VLA settings.

1

NLL Uncertainty-Aware Prediction Head

Standard deterministic regression treats all force errors equally, failing to account for label noise and the inherent variability of boundary samples. We replace the regression head with an NLL formulation, making the model simultaneously predict force mean and uncertainty. This explicitly dampens the impact of noisy near-failure labels and produces calibrated predictions.

LNLL = (F − μ)² / (2σ²) + log σ

High σ on boundary samples means the model correctly signals "I am uncertain here"—actionable for downstream safety planning. EgoForce w/o MoE (NLL+GRU+Contact) reaches Sponge R²=0.656 and the lowest 0–5 N MAE (1.66 N) overall.

2

Finger-Distance-Conditioned Contact State Encoding

RGB images alone cannot distinguish approach, first contact, or firm grasp. We inject the thumb–index distance as an additional input feature whose trajectory continuously encodes contact phase—information invisible to the camera.

dti = ‖pthumb − pindex2  ;  zACS = [s6-DoF ⊕ dti]

Ablation: removing this input raises Jelly MAE from 5.115→5.854 (+14%) and drops R² by 0.172. Contact-phase context that pure visual features cannot recover.

3

AdaMoE: Cross-Scenario Expert Routing

Different materials (Fragile, Deformable, Soft) exhibit heterogeneous force patterns that a single model struggles to capture. AdaMoE uses a scene-adaptive router to dynamically combine a shared prediction head with multiple expert heads (top-k selection). A Scale Adapter decouples expert selection from contribution weighting, avoiding the load-balance–accuracy trade-off of vanilla MoE.

ŷ = wshared·hshared + Σk wk·hexpert-k

Router input: object one-hot + GRU hidden state. Result: AdaMoE improves Sponge R² but shows cross-material trade-offs on Jelly, revealing that material heterogeneity requires adaptive modeling.

Offline Evaluation Results

Real results on Jelly & Sponge materials (3 runs ±σ). Strong VLA backbones do not automatically outperform targeted force models on contact force estimation.

Method MAE (N) ↓ RMSE (N) ↓ R² ↑ Pearson r ↑ Contact F1 ↑ Speed
Ours (NLL + Temporal)
EgoForce ⭐ 4.5508.2720.7030.8580.903~1.67 s/step
Baselines
ACT 5.3189.9600.5690.8300.939~1.2 s/step
WALL-X 5.4209.4550.6110.8030.867~1.22 s/step
Large VLA Models
SmolVLA 7.62712.4400.3270.7150.862~2.4 s/step
Pi0 6.24311.0690.4680.7070.865~4.3 s/step
Pi0.5 6.45111.2450.4510.6980.879~5.4 s/step
X-VLA 8.70714.3390.1080.5850.857~2.1 s/step
Pi0-FAST 16.08832.543−4.5740.0150.843~10.8 s/step
!

Strong VLA backbones fail on continuous force estimation

Pi0-FAST collapses to overall R²=−4.574 and X-VLA to R²=0.108—both billion-parameter VLAs perform worse than the lightweight ACT baseline (R²=0.569) on continuous force curves. FEEL's controlled protocol isolates this failure mode that standard action benchmarks completely miss. Our EgoForce achieves overall R²=0.703.

✅ Normal Case — Full Can

Visual appearance matches actual weight. No force overshoot. Model correctly scales force.

time → force ↑
Correct Prior · No Overshoot
⚡ Mismatch — Empty Can (Deceptive)

Model applies heavy-can prior. Dramatic initial overshoot followed by correction—mirroring human "size-weight illusion."

time → ⬆ Overshoot (Visual Prior)
Visual Overshoot · Human-Like Misjudgment

Models reproduce human-like force overshoot under visual deception

The force overshoot in visuo-tactile models mirrors the human "size-weight illusion," confirming that FEEL models build genuine visual force priors—not just kinematic imitation. This misjudgment is a feature, not a bug: it proves that visual cues influence force prediction.

✅ Normal Case — Full Can

Visual appearance matches actual weight. No force overshoot. Model correctly scales initial force.

time → force ↑
Correct Prior · No Overshoot
⚡ Mismatch — Empty Can (Deceptive)

Model applies heavy-can prior. Dramatic initial overshoot—mirroring the human "size-weight illusion." This misjudgment is evidence of prior formation, not a bug.

time → ⬆ Overshoot (Visual Prior)
Visual Overshoot · Human-Like Misjudgment

Visual-mismatch overshoot proves prior formation, not kinematic imitation

A model that overshoots on the empty-can probe has built a genuine visual force prior—the same "size-weight illusion" humans experience. Models that don't overshoot are simply not forming physical priors, regardless of their semantic task performance. FEEL uses this as a diagnostic signal.

Two paper-aligned analyses unpack what each inductive bias contributes (single seed = 1000). Top: ablation across our three GRU-based prediction heads (paper Table 2). Bottom: per-model MAE stratified by ground-truth force magnitude (paper Table 3).

GRU Head Ablation (Paper Table 2)

Variant Overall R² ↑ Overall MAE ↓ Sponge R² ↑ Empty B. R² ↑ Full B. R² ↑ 0–5 N MAE ↓
EgoForce w/o MoE & NLL & Contact (L1 head) 0.6774.9160.123 0.6410.3883.47
EgoForce w/o MoE & Contact 0.6304.9860.623 0.6230.2772.84
EgoForce w/o MoE (NLL+GRU+Contact) 0.6294.8310.656 0.3680.2841.66

All three variants share the ResNet-18 tactile encoder, ACS state encoder, and GRU temporal backbone—differing only in the prediction head. Plain L1 wins overall R² and the high-force Empty / Full Bottle categories; NLL+Contact wins the low-force regimes (Sponge R²=0.656, 0–5 N MAE=1.66 N) thanks to heteroscedastic uncertainty + auxiliary contact loss. No single head wins everywhere — motivating the material-aware AdaMoE router (overall R²=0.703).

Force-Range-Stratified MAE (Paper Table 3)

Model 0–5 N MAE ↓ 5–20 N MAE ↓ 20–50 N MAE ↓
EgoForce w/o MoE1.663.8113.42
EgoForce ⭐2.663.1110.95
EgoForce w/o MoE & Contact2.843.6212.77
ACT3.223.8613.73
EgoForce w/o MoE & NLL & Contact3.473.4711.05
WALL-X4.443.5911.69
Pi04.384.0913.08
Pi0.54.754.3113.34
SmolVLA7.595.3714.49
X-VLA9.425.8013.33
Pi0-FAST12.7613.2922.19

Every billion-parameter VLA degrades faster in the low-force regime than mid/high — SmolVLA and X-VLA lose 7.6–9.4 N on 0–5 N samples while keeping 13–15 N error on the much larger 20–50 N targets — suggesting they bias toward the population-mean force rather than tracking contact-onset transients.

Different inductive biases win different force regimes

Within the GRU family, NLL+Contact decisively wins the low-force regime (0–5 N MAE=1.66 N) where heteroscedastic uncertainty pays off, while plain L1 takes over in the high-force Empty / Full Bottle categories where mean prediction matters more. AdaMoE routes per-step to a shared head plus top-2 specialised experts, lifting overall R² from 0.629 to 0.703 and winning Jelly, Sponge, and Full Bottle simultaneously.

For each material, all 9 models predict on the same source clip with the predicted force curve overlaid against the ground truth. Switching the material tab tears down the previous 9 videos and rebuilds a fresh 3×3 grid, keeping at most 9 active video decoders in the DOM.

Fragile
Deformable
Rigid
Variable-Load

Order (row-major): EgoForce (ours), WALL-X, ACT, Pi0, Pi0.5, SmolVLA, X-VLA, GR00T, OpenTouch. All 9 clips play muted-autoplay-loop with drift correction against the top-left reference.

Core Insight from Real Data
R²=0.703
EgoForce (Ours) — best overall R²
R² < 0
Pi0-FAST — large VLA fails on force
R²=0.569
ACT — lightweight baseline beats massive VLAs

What FEEL Reveals

Four empirically grounded findings that distinguish FEEL from conventional action benchmarks.

1

Strong VLA backbones ≠ better force estimation

Massive VLA backbones produce low or negative overall R² on continuous force (X-VLA R²=0.108, SmolVLA 0.327, Pi0-FAST −4.574), often worse than the lightweight ACT baseline (R²=0.569). Their action-generation strengths do not transfer to continuous contact force prediction. FEEL's controlled protocol exposes this failure mode that standard task-success metrics completely hide.

2

Both visual input and hand state are necessary signals

Within our GRU family, removing the proprioceptive state cuts Jelly R² roughly in half, with thumb–index finger-distance dti as the dominant recoverable component (paper §5.2). Egocentric vision provides material/deformation cues; finger-distance+pose provides contact geometry. Models that ignore either modality cannot predict realistic force curves.

3

Temporal modeling + uncertainty awareness outperforms bigger backbones

EgoForce (~37M params) outperforms every billion-parameter foundation baseline (Pi0 578M, WALL-X 4.2B, Pi0-FAST 3.3B). The bottleneck for contact-force learning is not model scale but the right inductive biases: explicit temporal context (GRU), calibrated uncertainty (NLL head with auxiliary contact loss), and material-aware Mixture-of-Experts routing.

4

Material heterogeneity requires adaptive modeling; single models trade off

EgoForce's AdaMoE router lifts overall R² from 0.629 (no-MoE baseline) to 0.703 and is the only configuration simultaneously best on Jelly, Sponge, and Full Bottle. Routing analysis reveals partial expert specialisation along low-force (Sponge / Empty Bottle) versus high-force (Full Bottle) regimes, with Jelly shared—evidence that FEEL's F/D/V taxonomy captures physically meaningful variation.

Contributions

Open-Source Hardware Blueprint

STL files for the FEEL data collection rig. Print, assemble, and replicate the sensor mount at low cost.

BibTeX

@inproceedings{feel2026,
  title     = {Do Embodied AI Models Have Physical Intuition? An Egocentric Probing Benchmark},
  author    = {Anonymous Authors},
  booktitle = {Proceedings of the Conference on Robot Learning},
  year      = {2026},
  note      = {Under Review}
}