NeurIPS 2026 · Dataset & Benchmark Track

FEEL: Force Intuition from
Egocentric Experience Learning

An offline representation benchmark and reproducible visuo-tactile data engine that probes whether Vision-Language-Action models develop human-like physical intuition.

Anonymous Authors1
1 Anonymous Institution  ·  Under review
🥚

Physical Boundary Probing

Boundary samples with deliberate failure modes expose safety-limit prediction in VLA models

🥤

Visual-Physical Mismatch

Empty Coke can deception reveals whether models build true visual force priors or just mimic

🌊

Visual Feedforward Prior

Water-level experiments test proportional force scaling as evidence of physical intuition

TL;DR

We introduce FEEL, an offline benchmark testing whether VLA models develop human-like physical intuition. By designing controlled water-level prior, boundary, and visual-mismatch experiments, we show that generative policies (Flow Matching) outperform discrete token-based models on continuous force dynamics— and that tactile sensing is indispensable when vision alone is deceived.

Overview

While Vision-Language-Action (VLA) models have demonstrated remarkable semantic intelligence, they still lack the physical intuition inherent to human manipulation. Humans routinely form visual force priors before contact: for example, they scale the initial grasp force with the perceived water level in a cup, and they over-apply force when a visually heavy object—such as a sealed-looking but empty Coke can—violates that prior.

To investigate whether foundation models acquire similar physical priors, we introduce FEEL (Force Intuition from Egocentric Experience Learning), an offline representation benchmark and reproducible egocentric visuo-tactile data engine designed specifically for controlled force-intuition probes. Unlike broad in-the-wild tactile datasets, FEEL focuses on boundary samples and visual-physical mismatch samples that expose how models estimate force, weight, and safety limits.

To improve force learning in existing VLAs, we further introduce Aperture-Conditioned Stiffness Anchoring and a Physics-Aware Derivative Penalty loss. Through offline evaluations across 8 VLA/policy architectures, we test whether models reproduce human-like visual priors, force overshoot misjudgments, and safety-boundary predictions. Our results suggest that generative policies (e.g., Flow Matching) are better suited than discrete token-based models for continuous physical dynamics.

The Physical Intuition Gap

Human manipulation relies on internal sensorimotor models that generate visual feedforward priors before contact. Current VLA models lack this capacity.

🧠

Human Internal Model

Feedforward + Feedback

Cognitive science shows that humans employ visual feedforward priors before contact. Seeing a full cup triggers anticipatory scaling of grasp force. A deceptive empty Coke can causes measurable force overshoot—proof of visual prior formation.

Feedforward: Initial force ∝ visual water level
Feedback: Tactile correction after contact transient
🤖

Current VLA Models

Vision-Only · No Physical Prior

Models like RT-2 and OpenVLA excel at semantic understanding and instruction following, yet they lack explicit physical grounding. They may correctly pick up an egg, but without force intuition, the result is often catastrophic failure at safety boundaries.

Problem 1: No force scaling from visual cues
Problem 2: Brittle under visual-physical mismatch
Core Scientific Question

"Can current VLA models learn this human-like physical intuition, and under what circumstances do they struggle?"

The FEEL Benchmark & Data Engine

A reproducible visuo-tactile collection setup with hardware blueprints and spatio-temporal synchronization algorithms.

Data Collection Pipeline

👓

Meta Aria Glasses

6-DoF wrist & finger poses via MPS · mm accuracy

Tactile Sensor

High-res contact force distribution · deformation map

⏱️

Spatio-Temporal Sync

Aligns high-freq visual/IMU streams with tactile readings

📐

3D-Print Blueprint

Open-source STL + BOM for low-cost sensor mount

📦

FEEL Dataset

F/D/S taxonomy · boundary & mismatch samples

F/D/S Physical Taxonomy

F

Fragile

Brittle failure modes requiring precise force thresholding to avoid catastrophic damage.

EggsChipsCrackers
D

Deformable

Elastic/plastic behaviors requiring visual deformation-to-force mapping.

SpongeBreadBottles
S

Soft/Fluid

Dynamic mass distribution requiring weight/CoM estimation from visual cues.

Water cupsRice bagsCoke cans

Three Controlled Probe Types

⚠️

Boundary Samples

Demonstrators deliberately approach or exceed unsafe force limits (e.g., crushing an egg). Tests safety-boundary prediction.

🌊

Water-Level Prior Samples

Cups filled at 0%, 25%, 50%, 75%, 100% capacity. Tests whether initial force scales with visual perception of weight.

🎭

Visual-Physical Mismatch

Visual appearance conflicts with true mass (empty can). Tests force overshoot and human-like "size-weight illusion" reproduction.

Technical Contributions for Force Learning

Two targeted improvements that significantly enhance physical dynamics prediction in existing VLA architectures.

1

Aperture-Conditioned Stiffness Anchoring

Classical mechanics relates force to deformation via Hooke's law. We explicitly couple gripper aperture (thumb–index distance) with force estimation to anchor object stiffness, providing the model a continuous physical reference that pure visual features cannot supply.

F = k · Δx    where Δx = thumb–index aperture

The aperture signal acts as a stiffness anchor: the same visual object at different compression states maps to different forces, resolving ambiguity that is otherwise impossible from vision alone.

2

Physics-Aware Derivative Penalty Loss

Standard MSE loss treats all force errors equally, failing to penalize dangerous overshoots that could shatter fragile objects. Our derivative-aware loss explicitly penalizes abrupt force transients and overshoots, producing safer, physically coherent predictions.

L = LMSE + λ1·Lovershoot + λ2·||ΔF'||²

Lovershoot penalizes predictions that exceed the ground-truth peak. The derivative regularizer ||ΔF'||² encourages smooth, physically plausible force curves.

Offline Evaluation Results

Three controlled probes across 8 VLA architectures, tested on physical consistency, safety boundaries, and visual prior reproduction.

Initial Peak Force vs. Visual Water Level — Visuo-Tactile Model (Pi0)
0% 25% 50% 75% 100% 0 2N 4N 6N Water Level Human Visuo-Tactile (Pi0) Vision-Only (Pi0)

Visuo-Tactile models successfully learn visual feedforward priors

Initial peak force scales proportionally with visual water level (Pearson r > 0.97 for Pi0 V+T), closely matching human demonstrators. Vision-only models show near-flat responses, confirming that tactile input is essential for force prior formation.

✅ Normal Case — Full Can

Visual appearance matches actual weight. No force overshoot. Model correctly scales force.

time → force ↑
Correct Prior · No Overshoot
⚡ Mismatch — Empty Can (Deceptive)

Model applies heavy-can prior. Dramatic initial overshoot followed by correction—mirroring human "size-weight illusion."

time → ⬆ Overshoot (Visual Prior)
Visual Overshoot · Human-Like Misjudgment

Models reproduce human-like force overshoot under visual deception

The force overshoot in visuo-tactile models mirrors the human "size-weight illusion," confirming that FEEL models build genuine visual force priors—not just kinematic imitation. This misjudgment is a feature, not a bug: it proves that visual cues influence force prediction.

Architecture Modality Force MAE ↓ Weight MAE ↓ RMSE ↓ R² ↑
CNN-Based
ACT (CNN-BC) Vision-Only 1.8542.32.100.45
Visuo-Tactile 0.9218.51.150.78
VLM-Based
SmolVLA Vision-Only 1.7840.11.950.48
Visuo-Tactile 0.8816.21.020.81
Flow Matching (Generative)
Pi0 Vision-Only 1.6538.51.880.52
Visuo-Tactile 0.6512.40.850.89
Transformer
GR00T Vision-Only 1.7039.21.900.50
Visuo-Tactile 0.7214.10.920.86

Flow Matching achieves best physical dynamics modeling

Pi0 (Flow Matching) + Visuo-Tactile achieves the lowest Force MAE (0.65) and highest R² (0.89), outperforming discrete token-based models. Across all architectures, Visuo-Tactile outperforms Vision-Only by 40–60% on force estimation.

Architecture Modality Early Warning (s) ↑ Violation Rate (%) ↓ F1 (Unsafe) ↑
ACT Vision-Only 0.1278.50.34
Visuo-Tactile 0.6522.10.76
Pi0 Vision-Only 0.1575.20.38
Visuo-Tactile 0.8812.40.89

Tactile input enables anticipatory safety boundary detection

Vision-only models fail catastrophically on Fragile objects (violation rate > 75%). With tactile sensing, Pi0 achieves 0.88s early warning before failure—sufficient to prevent most physical damage—reducing violations by 83%.

Architectural Physical Intelligence Ranking

Pi0
Flow Matching
Physical Score
89%
GR00T
Transformer
Physical Score
86%
SmolVLA
VLM
Physical Score
81%
ACT
CNN-BC
Physical Score
78%

Architectural Insights

What FEEL reveals about the physical intelligence landscape of current foundation models.

1

Generative policies are better suited for continuous physical dynamics

Flow Matching architectures (Pi0) outperform discrete token-based models (ACT) on all force-related metrics. The continuous generation process inherently preserves temporal coherence in force trajectories.

2

Tactile sensing is irreplaceable when vision is deceived

On visual-physical mismatch samples, vision-only models fail to distinguish the deception—their force predictions remain incorrect. Only models with tactile input can form and correct physical priors.

3

Force overshoot under visual mismatch is evidence of prior formation, not a failure

Models that produce force overshoot on the empty Coke can experiment are demonstrating genuine visual prior formation—the same "size-weight illusion" that humans experience. Models without this overshoot are simply not building physical priors.

4

Aperture-conditioned stiffness anchoring reduces force MAE by ~29%

Ablation studies confirm that thumb–index aperture as an auxiliary input provides a direct physical reference that resolves ambiguity in stiffness estimation, yielding substantial improvements across all object categories.

Contributions

BibTeX

@inproceedings{feel2026,
  title     = {FEEL: Force Intuition from Egocentric Experience Learning},
  author    = {Anonymous Authors},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2026},
  note      = {Under Review},
  url       = {https://github.com/306327680/FEEL-Unlocking-Force-Intuition-in-VLA-Models_Web}
}