An offline representation benchmark and reproducible visuo-tactile data engine that probes whether Vision-Language-Action models develop human-like physical intuition.
Human manipulation relies on internal sensorimotor models that generate visual feedforward priors before contact. Current VLA models lack this capacity.
Cognitive science shows that humans employ visual feedforward priors before contact. Seeing a full cup triggers anticipatory scaling of grasp force. A deceptive empty Coke can causes measurable force overshoot—proof of visual prior formation.
Models like RT-2 and OpenVLA excel at semantic understanding and instruction following, yet they lack explicit physical grounding. They may correctly pick up an egg, but without force intuition, the result is often catastrophic failure at safety boundaries.
"Can current VLA models learn this human-like physical intuition, and under what circumstances do they struggle?"
A reproducible visuo-tactile collection setup with hardware blueprints and spatio-temporal synchronization algorithms.
6-DoF wrist & finger poses via MPS · mm accuracy
High-res contact force distribution · deformation map
Aligns high-freq visual/IMU streams with tactile readings
Open-source STL + BOM for low-cost sensor mount
F/D/S taxonomy · boundary & mismatch samples
Brittle failure modes requiring precise force thresholding to avoid catastrophic damage.
Elastic/plastic behaviors requiring visual deformation-to-force mapping.
Dynamic mass distribution requiring weight/CoM estimation from visual cues.
Demonstrators deliberately approach or exceed unsafe force limits (e.g., crushing an egg). Tests safety-boundary prediction.
Cups filled at 0%, 25%, 50%, 75%, 100% capacity. Tests whether initial force scales with visual perception of weight.
Visual appearance conflicts with true mass (empty can). Tests force overshoot and human-like "size-weight illusion" reproduction.
Two targeted improvements that significantly enhance physical dynamics prediction in existing VLA architectures.
Classical mechanics relates force to deformation via Hooke's law. We explicitly couple gripper aperture (thumb–index distance) with force estimation to anchor object stiffness, providing the model a continuous physical reference that pure visual features cannot supply.
The aperture signal acts as a stiffness anchor: the same visual object at different compression states maps to different forces, resolving ambiguity that is otherwise impossible from vision alone.
Standard MSE loss treats all force errors equally, failing to penalize dangerous overshoots that could shatter fragile objects. Our derivative-aware loss explicitly penalizes abrupt force transients and overshoots, producing safer, physically coherent predictions.
Lovershoot penalizes predictions that exceed the ground-truth peak. The derivative regularizer ||ΔF'||² encourages smooth, physically plausible force curves.
Three controlled probes across 8 VLA architectures, tested on physical consistency, safety boundaries, and visual prior reproduction.
Initial peak force scales proportionally with visual water level (Pearson r > 0.97 for Pi0 V+T), closely matching human demonstrators. Vision-only models show near-flat responses, confirming that tactile input is essential for force prior formation.
Visual appearance matches actual weight. No force overshoot. Model correctly scales force.
Model applies heavy-can prior. Dramatic initial overshoot followed by correction—mirroring human "size-weight illusion."
The force overshoot in visuo-tactile models mirrors the human "size-weight illusion," confirming that FEEL models build genuine visual force priors—not just kinematic imitation. This misjudgment is a feature, not a bug: it proves that visual cues influence force prediction.
| Architecture | Modality | Force MAE ↓ | Weight MAE ↓ | RMSE ↓ | R² ↑ |
|---|---|---|---|---|---|
| CNN-Based | |||||
| ACT (CNN-BC) | Vision-Only | 1.85 | 42.3 | 2.10 | 0.45 |
| Visuo-Tactile | 0.92 | 18.5 | 1.15 | 0.78 | |
| VLM-Based | |||||
| SmolVLA | Vision-Only | 1.78 | 40.1 | 1.95 | 0.48 |
| Visuo-Tactile | 0.88 | 16.2 | 1.02 | 0.81 | |
| Flow Matching (Generative) | |||||
| Pi0 | Vision-Only | 1.65 | 38.5 | 1.88 | 0.52 |
| Visuo-Tactile | 0.65 | 12.4 | 0.85 | 0.89 | |
| Transformer | |||||
| GR00T | Vision-Only | 1.70 | 39.2 | 1.90 | 0.50 |
| Visuo-Tactile | 0.72 | 14.1 | 0.92 | 0.86 | |
Pi0 (Flow Matching) + Visuo-Tactile achieves the lowest Force MAE (0.65) and highest R² (0.89), outperforming discrete token-based models. Across all architectures, Visuo-Tactile outperforms Vision-Only by 40–60% on force estimation.
| Architecture | Modality | Early Warning (s) ↑ | Violation Rate (%) ↓ | F1 (Unsafe) ↑ |
|---|---|---|---|---|
| ACT | Vision-Only | 0.12 | 78.5 | 0.34 |
| Visuo-Tactile | 0.65 | 22.1 | 0.76 | |
| Pi0 | Vision-Only | 0.15 | 75.2 | 0.38 |
| Visuo-Tactile | 0.88 | 12.4 | 0.89 |
Vision-only models fail catastrophically on Fragile objects (violation rate > 75%). With tactile sensing, Pi0 achieves 0.88s early warning before failure—sufficient to prevent most physical damage—reducing violations by 83%.
What FEEL reveals about the physical intelligence landscape of current foundation models.
Flow Matching architectures (Pi0) outperform discrete token-based models (ACT) on all force-related metrics. The continuous generation process inherently preserves temporal coherence in force trajectories.
On visual-physical mismatch samples, vision-only models fail to distinguish the deception—their force predictions remain incorrect. Only models with tactile input can form and correct physical priors.
Models that produce force overshoot on the empty Coke can experiment are demonstrating genuine visual prior formation—the same "size-weight illusion" that humans experience. Models without this overshoot are simply not building physical priors.
Ablation studies confirm that thumb–index aperture as an auxiliary input provides a direct physical reference that resolves ambiguity in stiffness estimation, yielding substantial improvements across all object categories.
The first offline representation benchmark for probing physical force intuition in VLA models. Includes hardware blueprints (3D-printable STL + BOM), spatio-temporal synchronization algorithms, and an F/D/S taxonomy with boundary and mismatch samples.
Aperture-Conditioned Stiffness Anchoring and Physics-Aware Derivative Penalty loss. Together these reduce force MAE by ~30% and overshoot rate by ~40% across evaluated architectures.
Systematic evidence that generative policies (Flow Matching) outperform discrete token-based models on continuous force dynamics, and that the size-weight illusion reproduction is a diagnostic marker of genuine physical prior formation.
Full release of hardware blueprints, synchronization code, and dataset. Democratizes visuo-tactile data collection toward internet-scale embodied intelligence data pipelines.
@inproceedings{feel2026, title = {FEEL: Force Intuition from Egocentric Experience Learning}, author = {Anonymous Authors}, booktitle = {Advances in Neural Information Processing Systems}, year = {2026}, note = {Under Review}, url = {https://github.com/306327680/FEEL-Unlocking-Force-Intuition-in-VLA-Models_Web} }