An offline representation benchmark and reproducible visuo-tactile data engine that probes whether Vision-Language-Action models develop human-like physical intuition.
Human manipulation relies on internal sensorimotor models that generate visual feedforward priors before contact. Current VLA models lack this capacity.
Cognitive science shows that humans employ visual feedforward priors before contact. Seeing a full cup triggers anticipatory scaling of grasp force. A deceptive empty milk carton causes measurable force overshoot—proof of visual prior formation.
Models like RT-2 and OpenVLA excel at semantic understanding and instruction following, yet they lack explicit physical grounding. They may correctly pick up a potato chip, but without force intuition, the result is often catastrophic failure at safety boundaries.
"Can current VLA models learn this human-like physical intuition, and under what circumstances do they struggle?"
A reproducible visuo-tactile collection setup with hardware blueprints and spatio-temporal synchronization algorithms.
6-DoF wrist & finger poses via MPS · mm accuracy
PX6AX-GEN3-DP-S2015-Elite on thumb & index · high-res fingertip force / deformation
Aligns high-freq visual/IMU streams with tactile readings
Open-source STL + BOM for low-cost sensor mount
F/D/V taxonomy · boundary & mismatch samples
Brittle failure modes requiring precise force thresholding to avoid catastrophic damage.
Elastic/plastic behaviors requiring visual deformation-to-force mapping.
Objects whose effective weight varies with content or fill level, requiring force estimation under a changing, visually-inferred load.
Demonstrators deliberately approach or exceed unsafe force limits (e.g., crushing a potato chip). Tests safety-boundary prediction.
Cups filled at 0%, 25%, 50%, 75%, 100% capacity. Tests whether initial force scales with visual perception of weight.
Visual appearance conflicts with true mass (empty can). Tests force overshoot and human-like "size-weight illusion" reproduction.
First-person RGB clips from three experiment categories, captured with Meta Aria glasses and synchronized tactile sensors.
Three targeted contributions that together improve continuous contact force prediction in egocentric VLA settings.
Standard deterministic regression treats all force errors equally, failing to account for label noise and the inherent variability of boundary samples. We replace the regression head with an NLL formulation, making the model simultaneously predict force mean and uncertainty. This explicitly dampens the impact of noisy near-failure labels and produces calibrated predictions.
High σ on boundary samples means the model correctly signals "I am uncertain here"—actionable for downstream safety planning. EgoForce w/o MoE (NLL+GRU+Contact) reaches Sponge R²=0.656 and the lowest 0–5 N MAE (1.66 N) overall.
RGB images alone cannot distinguish approach, first contact, or firm grasp. We inject the thumb–index distance as an additional input feature whose trajectory continuously encodes contact phase—information invisible to the camera.
Ablation: removing this input raises Jelly MAE from 5.115→5.854 (+14%) and drops R² by 0.172. Contact-phase context that pure visual features cannot recover.
Different materials (Fragile, Deformable, Soft) exhibit heterogeneous force patterns that a single model struggles to capture. AdaMoE uses a scene-adaptive router to dynamically combine a shared prediction head with multiple expert heads (top-k selection). A Scale Adapter decouples expert selection from contribution weighting, avoiding the load-balance–accuracy trade-off of vanilla MoE.
Router input: object one-hot + GRU hidden state. Result: AdaMoE improves Sponge R² but shows cross-material trade-offs on Jelly, revealing that material heterogeneity requires adaptive modeling.
Real results on Jelly & Sponge materials (3 runs ±σ). Strong VLA backbones do not automatically outperform targeted force models on contact force estimation.
| Method | MAE (N) ↓ | RMSE (N) ↓ | R² ↑ | Pearson r ↑ | Contact F1 ↑ | Speed |
|---|---|---|---|---|---|---|
| Ours (NLL + Temporal) | ||||||
| EgoForce ⭐ | 4.550 | 8.272 | 0.703 | 0.858 | 0.903 | ~1.67 s/step |
| Baselines | ||||||
| ACT | 5.318 | 9.960 | 0.569 | 0.830 | 0.939 | ~1.2 s/step |
| WALL-X | 5.420 | 9.455 | 0.611 | 0.803 | 0.867 | ~1.22 s/step |
| Large VLA Models | ||||||
| SmolVLA | 7.627 | 12.440 | 0.327 | 0.715 | 0.862 | ~2.4 s/step |
| Pi0 | 6.243 | 11.069 | 0.468 | 0.707 | 0.865 | ~4.3 s/step |
| Pi0.5 | 6.451 | 11.245 | 0.451 | 0.698 | 0.879 | ~5.4 s/step |
| X-VLA | 8.707 | 14.339 | 0.108 | 0.585 | 0.857 | ~2.1 s/step |
| Pi0-FAST | 16.088 | 32.543 | −4.574 | 0.015 | 0.843 | ~10.8 s/step |
Pi0-FAST collapses to overall R²=−4.574 and X-VLA to R²=0.108—both billion-parameter VLAs perform worse than the lightweight ACT baseline (R²=0.569) on continuous force curves. FEEL's controlled protocol isolates this failure mode that standard action benchmarks completely miss. Our EgoForce achieves overall R²=0.703.
Visual appearance matches actual weight. No force overshoot. Model correctly scales force.
Model applies heavy-can prior. Dramatic initial overshoot followed by correction—mirroring human "size-weight illusion."
The force overshoot in visuo-tactile models mirrors the human "size-weight illusion," confirming that FEEL models build genuine visual force priors—not just kinematic imitation. This misjudgment is a feature, not a bug: it proves that visual cues influence force prediction.
Visual appearance matches actual weight. No force overshoot. Model correctly scales initial force.
Model applies heavy-can prior. Dramatic initial overshoot—mirroring the human "size-weight illusion." This misjudgment is evidence of prior formation, not a bug.
A model that overshoots on the empty-can probe has built a genuine visual force prior—the same "size-weight illusion" humans experience. Models that don't overshoot are simply not forming physical priors, regardless of their semantic task performance. FEEL uses this as a diagnostic signal.
Two paper-aligned analyses unpack what each inductive bias contributes (single seed = 1000). Top: ablation across our three GRU-based prediction heads (paper Table 2). Bottom: per-model MAE stratified by ground-truth force magnitude (paper Table 3).
| Variant | Overall R² ↑ | Overall MAE ↓ | Sponge R² ↑ | Empty B. R² ↑ | Full B. R² ↑ | 0–5 N MAE ↓ |
|---|---|---|---|---|---|---|
| EgoForce w/o MoE & NLL & Contact (L1 head) | 0.677 | 4.916 | 0.123 | 0.641 | 0.388 | 3.47 |
| EgoForce w/o MoE & Contact | 0.630 | 4.986 | 0.623 | 0.623 | 0.277 | 2.84 |
| EgoForce w/o MoE (NLL+GRU+Contact) | 0.629 | 4.831 | 0.656 | 0.368 | 0.284 | 1.66 |
All three variants share the ResNet-18 tactile encoder, ACS state encoder, and GRU temporal backbone—differing only in the prediction head. Plain L1 wins overall R² and the high-force Empty / Full Bottle categories; NLL+Contact wins the low-force regimes (Sponge R²=0.656, 0–5 N MAE=1.66 N) thanks to heteroscedastic uncertainty + auxiliary contact loss. No single head wins everywhere — motivating the material-aware AdaMoE router (overall R²=0.703).
| Model | 0–5 N MAE ↓ | 5–20 N MAE ↓ | 20–50 N MAE ↓ |
|---|---|---|---|
| EgoForce w/o MoE | 1.66 | 3.81 | 13.42 |
| EgoForce ⭐ | 2.66 | 3.11 | 10.95 |
| EgoForce w/o MoE & Contact | 2.84 | 3.62 | 12.77 |
| ACT | 3.22 | 3.86 | 13.73 |
| EgoForce w/o MoE & NLL & Contact | 3.47 | 3.47 | 11.05 |
| WALL-X | 4.44 | 3.59 | 11.69 |
| Pi0 | 4.38 | 4.09 | 13.08 |
| Pi0.5 | 4.75 | 4.31 | 13.34 |
| SmolVLA | 7.59 | 5.37 | 14.49 |
| X-VLA | 9.42 | 5.80 | 13.33 |
| Pi0-FAST | 12.76 | 13.29 | 22.19 |
Every billion-parameter VLA degrades faster in the low-force regime than mid/high — SmolVLA and X-VLA lose 7.6–9.4 N on 0–5 N samples while keeping 13–15 N error on the much larger 20–50 N targets — suggesting they bias toward the population-mean force rather than tracking contact-onset transients.
Within the GRU family, NLL+Contact decisively wins the low-force regime (0–5 N MAE=1.66 N) where heteroscedastic uncertainty pays off, while plain L1 takes over in the high-force Empty / Full Bottle categories where mean prediction matters more. AdaMoE routes per-step to a shared head plus top-2 specialised experts, lifting overall R² from 0.629 to 0.703 and winning Jelly, Sponge, and Full Bottle simultaneously.
For each material, all 9 models predict on the same source clip with the predicted force curve overlaid against the ground truth. Switching the material tab tears down the previous 9 videos and rebuilds a fresh 3×3 grid, keeping at most 9 active video decoders in the DOM.
Order (row-major): EgoForce (ours), WALL-X, ACT, Pi0, Pi0.5, SmolVLA, X-VLA, GR00T, OpenTouch. All 9 clips play muted-autoplay-loop with drift correction against the top-left reference.
Four empirically grounded findings that distinguish FEEL from conventional action benchmarks.
Massive VLA backbones produce low or negative overall R² on continuous force (X-VLA R²=0.108, SmolVLA 0.327, Pi0-FAST −4.574), often worse than the lightweight ACT baseline (R²=0.569). Their action-generation strengths do not transfer to continuous contact force prediction. FEEL's controlled protocol exposes this failure mode that standard task-success metrics completely hide.
Within our GRU family, removing the proprioceptive state cuts Jelly R² roughly in half, with thumb–index finger-distance dti as the dominant recoverable component (paper §5.2). Egocentric vision provides material/deformation cues; finger-distance+pose provides contact geometry. Models that ignore either modality cannot predict realistic force curves.
EgoForce (~37M params) outperforms every billion-parameter foundation baseline (Pi0 578M, WALL-X 4.2B, Pi0-FAST 3.3B). The bottleneck for contact-force learning is not model scale but the right inductive biases: explicit temporal context (GRU), calibrated uncertainty (NLL head with auxiliary contact loss), and material-aware Mixture-of-Experts routing.
EgoForce's AdaMoE router lifts overall R² from 0.629 (no-MoE baseline) to 0.703 and is the only configuration simultaneously best on Jelly, Sponge, and Full Bottle. Routing analysis reveals partial expert specialisation along low-force (Sponge / Empty Bottle) versus high-force (Full Bottle) regimes, with Jelly shared—evidence that FEEL's F/D/V taxonomy captures physically meaningful variation.
A benchmark specifically designed for testing whether VLA models develop human-like visual force priors—not a general egocentric tactile corpus. OpenTouch proved wearable touch collection is feasible; FEEL advances this with controlled counterfactual protocols (water-level, boundary, visual-mismatch), enabling cognitive-style probing of physical intuition in foundation models.
Two architectural contributions targeting the specific challenges of contact force learning: (1) NLL prediction head for uncertainty-aware regression under label noise and boundary variability; (2) AdaMoE scene-adaptive routing with Scale Adapter, which decouples expert selection from contribution weighting and reveals material heterogeneity as a first-class modeling challenge.
Thumb–index distance injected as input feature encoding contact phase (approach, onset, stable grasp)—invisible to RGB alone. Ablation: removing this input raises Jelly MAE by +14% and drops R² by 0.172.
Systematic evidence that state-of-the-art generative VLAs (Pi0, Pi0.5, X-VLA) fail on continuous contact force estimation despite excelling at semantic tasks. The bottleneck is not model scale but modeling inductive bias—temporal context, uncertainty calibration, and scene-adaptive routing. FEEL provides the controlled benchmark to measure this gap.
@inproceedings{feel2026, title = {Do Embodied AI Models Have Physical Intuition? An Egocentric Probing Benchmark}, author = {Anonymous Authors}, booktitle = {Proceedings of the Conference on Robot Learning}, year = {2026}, note = {Under Review} }