FEEL: Force Intuition from Egocentric Experience Learning

Abstract

Overview

While Vision-Language-Action (VLA) models have demonstrated remarkable semantic intelligence, they still lack the physical intuition inherent to human manipulation. Humans routinely form visual force priors before contact: for example, they scale the initial grasp force with the perceived water level in a cup, and they over-apply force when a visually heavy object—such as a sealed-looking but empty Coke can—violates that prior.

To investigate whether foundation models acquire similar physical priors, we introduce FEEL (Force Intuition from Egocentric Experience Learning), an offline representation benchmark and reproducible egocentric visuo-tactile data engine designed specifically for controlled force-intuition probes. Unlike broad in-the-wild tactile datasets, FEEL focuses on boundary samples and visual-physical mismatch samples that expose how models estimate force, weight, and safety limits.

To improve force learning in existing VLAs, we further introduce Aperture-Conditioned Stiffness Anchoring and a Physics-Aware Derivative Penalty loss. Through offline evaluations across 8 VLA/policy architectures, we test whether models reproduce human-like visual priors, force overshoot misjudgments, and safety-boundary predictions. Our results suggest that generative policies (e.g., Flow Matching) are better suited than discrete token-based models for continuous physical dynamics.

Motivation

The Physical Intuition Gap

Human manipulation relies on internal sensorimotor models that generate visual feedforward priors before contact. Current VLA models lack this capacity.

🧠

Human Internal Model

Feedforward + Feedback

Cognitive science shows that humans employ visual feedforward priors before contact. Seeing a full cup triggers anticipatory scaling of grasp force. A deceptive empty Coke can causes measurable force overshoot—proof of visual prior formation.

Feedforward: Initial force ∝ visual water level
Feedback: Tactile correction after contact transient

🤖

Current VLA Models

Vision-Only · No Physical Prior

Models like RT-2 and OpenVLA excel at semantic understanding and instruction following, yet they lack explicit physical grounding. They may correctly pick up an egg, but without force intuition, the result is often catastrophic failure at safety boundaries.

Problem 1: No force scaling from visual cues
Problem 2: Brittle under visual-physical mismatch

Core Scientific Question

"Can current VLA models learn this human-like physical intuition, and under what circumstances do they struggle?"

Benchmark

The FEEL Benchmark & Data Engine

A reproducible visuo-tactile collection setup with hardware blueprints and spatio-temporal synchronization algorithms.

Data Collection Pipeline

👓

Meta Aria Glasses

6-DoF wrist & finger poses via MPS · mm accuracy

✋

Tactile Sensor

High-res contact force distribution · deformation map

⏱️

Spatio-Temporal Sync

Aligns high-freq visual/IMU streams with tactile readings

📐

3D-Print Blueprint

Open-source STL + BOM for low-cost sensor mount

📦

FEEL Dataset

F/D/S taxonomy · boundary & mismatch samples

F/D/S Physical Taxonomy

Fragile

Brittle failure modes requiring precise force thresholding to avoid catastrophic damage.

EggsChipsCrackers

Deformable

Elastic/plastic behaviors requiring visual deformation-to-force mapping.

SpongeBreadBottles

Soft/Fluid

Dynamic mass distribution requiring weight/CoM estimation from visual cues.

Water cupsRice bagsCoke cans

Three Controlled Probe Types

⚠️

Boundary Samples

Demonstrators deliberately approach or exceed unsafe force limits (e.g., crushing an egg). Tests safety-boundary prediction.

🌊

Water-Level Prior Samples

Cups filled at 0%, 25%, 50%, 75%, 100% capacity. Tests whether initial force scales with visual perception of weight.

🎭

Visual-Physical Mismatch

Visual appearance conflicts with true mass (empty can). Tests force overshoot and human-like "size-weight illusion" reproduction.

Method

Technical Contributions for Force Learning

Two targeted improvements that significantly enhance physical dynamics prediction in existing VLA architectures.

Aperture-Conditioned Stiffness Anchoring

Classical mechanics relates force to deformation via Hooke's law. We explicitly couple gripper aperture (thumb–index distance) with force estimation to anchor object stiffness, providing the model a continuous physical reference that pure visual features cannot supply.

F = k · Δx where Δx = thumb–index aperture

The aperture signal acts as a stiffness anchor: the same visual object at different compression states maps to different forces, resolving ambiguity that is otherwise impossible from vision alone.

Physics-Aware Derivative Penalty Loss

Standard MSE loss treats all force errors equally, failing to penalize dangerous overshoots that could shatter fragile objects. Our derivative-aware loss explicitly penalizes abrupt force transients and overshoots, producing safer, physically coherent predictions.

L = L_MSE + λ₁·L_overshoot + λ₂·||ΔF'||²

L_overshoot penalizes predictions that exceed the ground-truth peak. The derivative regularizer ||ΔF'||² encourages smooth, physically plausible force curves.

Experiments

Offline Evaluation Results

Three controlled probes across 8 VLA architectures, tested on physical consistency, safety boundaries, and visual prior reproduction.

Initial Peak Force vs. Visual Water Level — Visuo-Tactile Model (Pi0)

✓

Visuo-Tactile models successfully learn visual feedforward priors

Initial peak force scales proportionally with visual water level (Pearson r > 0.97 for Pi0 V+T), closely matching human demonstrators. Vision-only models show near-flat responses, confirming that tactile input is essential for force prior formation.

✅ Normal Case — Full Can

Visual appearance matches actual weight. No force overshoot. Model correctly scales force.

Correct Prior · No Overshoot

⚡ Mismatch — Empty Can (Deceptive)

Model applies heavy-can prior. Dramatic initial overshoot followed by correction—mirroring human "size-weight illusion."

Visual Overshoot · Human-Like Misjudgment

✓

Models reproduce human-like force overshoot under visual deception

The force overshoot in visuo-tactile models mirrors the human "size-weight illusion," confirming that FEEL models build genuine visual force priors—not just kinematic imitation. This misjudgment is a feature, not a bug: it proves that visual cues influence force prediction.

Architecture	Modality	Force MAE ↓	Weight MAE ↓	RMSE ↓	R² ↑
CNN-Based
ACT (CNN-BC)	Vision-Only	1.85	42.3	2.10	0.45
ACT (CNN-BC)	Visuo-Tactile	0.92	18.5	1.15	0.78
VLM-Based
SmolVLA	Vision-Only	1.78	40.1	1.95	0.48
SmolVLA	Visuo-Tactile	0.88	16.2	1.02	0.81
Flow Matching (Generative)
Pi0	Vision-Only	1.65	38.5	1.88	0.52
Pi0	Visuo-Tactile	0.65	12.4	0.85	0.89
Transformer
GR00T	Vision-Only	1.70	39.2	1.90	0.50
GR00T	Visuo-Tactile	0.72	14.1	0.92	0.86

✓

Flow Matching achieves best physical dynamics modeling

Pi0 (Flow Matching) + Visuo-Tactile achieves the lowest Force MAE (0.65) and highest R² (0.89), outperforming discrete token-based models. Across all architectures, Visuo-Tactile outperforms Vision-Only by 40–60% on force estimation.

Architecture	Modality	Early Warning (s) ↑	Violation Rate (%) ↓	F1 (Unsafe) ↑
ACT	Vision-Only	0.12	78.5	0.34
ACT	Visuo-Tactile	0.65	22.1	0.76
Pi0	Vision-Only	0.15	75.2	0.38
Pi0	Visuo-Tactile	0.88	12.4	0.89

✓

Tactile input enables anticipatory safety boundary detection

Vision-only models fail catastrophically on Fragile objects (violation rate > 75%). With tactile sensing, Pi0 achieves 0.88s early warning before failure—sufficient to prevent most physical damage—reducing violations by 83%.

Architectural Physical Intelligence Ranking

Pi0

Flow Matching

Physical Score

89%

GR00T

Transformer

Physical Score

86%

SmolVLA

VLM

Physical Score

81%

ACT

CNN-BC

Physical Score

78%

Key Findings

Architectural Insights

What FEEL reveals about the physical intelligence landscape of current foundation models.

Generative policies are better suited for continuous physical dynamics

Flow Matching architectures (Pi0) outperform discrete token-based models (ACT) on all force-related metrics. The continuous generation process inherently preserves temporal coherence in force trajectories.

Tactile sensing is irreplaceable when vision is deceived

On visual-physical mismatch samples, vision-only models fail to distinguish the deception—their force predictions remain incorrect. Only models with tactile input can form and correct physical priors.

Force overshoot under visual mismatch is evidence of prior formation, not a failure

Models that produce force overshoot on the empty Coke can experiment are demonstrating genuine visual prior formation—the same "size-weight illusion" that humans experience. Models without this overshoot are simply not building physical priors.

Aperture-conditioned stiffness anchoring reduces force MAE by ~29%

Ablation studies confirm that thumb–index aperture as an auxiliary input provides a direct physical reference that resolves ambiguity in stiffness estimation, yielding substantial improvements across all object categories.

Summary

Contributions

📐

FEEL Benchmark & Reproducible Engine

The first offline representation benchmark for probing physical force intuition in VLA models. Includes hardware blueprints (3D-printable STL + BOM), spatio-temporal synchronization algorithms, and an F/D/S taxonomy with boundary and mismatch samples.
⚙️

Technical Contributions for Force Learning

Aperture-Conditioned Stiffness Anchoring and Physics-Aware Derivative Penalty loss. Together these reduce force MAE by ~30% and overshoot rate by ~40% across evaluated architectures.
🔬

Cognitive & Architectural Insights

Systematic evidence that generative policies (Flow Matching) outperform discrete token-based models on continuous force dynamics, and that the size-weight illusion reproduction is a diagnostic marker of genuine physical prior formation.
🌐

Open-Source Visuo-Tactile Data Engine

Full release of hardware blueprints, synchronization code, and dataset. Democratizes visuo-tactile data collection toward internet-scale embodied intelligence data pipelines.

Overview

The Physical Intuition Gap

Human Internal Model

Current VLA Models

The FEEL Benchmark & Data Engine

Data Collection Pipeline

Meta Aria Glasses

Tactile Sensor

Spatio-Temporal Sync

3D-Print Blueprint

FEEL Dataset

F/D/S Physical Taxonomy

Fragile

Deformable

Soft/Fluid

Three Controlled Probe Types

Boundary Samples

Water-Level Prior Samples

Visual-Physical Mismatch

Technical Contributions for Force Learning

Aperture-Conditioned Stiffness Anchoring

Physics-Aware Derivative Penalty Loss

Offline Evaluation Results

Visuo-Tactile models successfully learn visual feedforward priors

Models reproduce human-like force overshoot under visual deception

Flow Matching achieves best physical dynamics modeling

Tactile input enables anticipatory safety boundary detection

Architectural Physical Intelligence Ranking

Architectural Insights

Generative policies are better suited for continuous physical dynamics

Tactile sensing is irreplaceable when vision is deceived

Force overshoot under visual mismatch is evidence of prior formation, not a failure

Aperture-conditioned stiffness anchoring reduces force MAE by ~29%

Contributions

FEEL Benchmark & Reproducible Engine

Technical Contributions for Force Learning

Cognitive & Architectural Insights

Open-Source Visuo-Tactile Data Engine

BibTeX