FEEL: Force Intuition from Egocentric Experience Learning

Abstract

Overview

When humans interact with familiar objects, they rely on visual cues to anticipate physical properties—such as weight, stiffness, and deformability—and adjust their force strategies before contact. However, while current Vision-Language-Action (VLA) models excel at semantic understanding and action generation, it remains an open question whether their representations can transfer to predicting continuous contact forces.

To investigate this, we introduce FEEL (Force Intuition from Egocentric Experience Learning), a controlled egocentric force-intuition probing benchmark. FEEL pairs first-person RGB video and precise 6-DoF hand poses with synchronized tactile signals. Crucially, it introduces boundary samples and visual-physical mismatch samples to probe force estimation when visual priors are unreliable. In total, the benchmark comprises 3.6 hours of synchronized vision-touch-pose data covering 7 object categories within the Fragile, Deformable, and Variable-Load taxonomy.

We systematically evaluate seven representative VLA baselines (ACT, WALL-X, Pi0, Pi0.5, Pi0-FAST, X-VLA, SmolVLA) and our own model family. Massive embodied backbones often yield low or negative R² on deformable objects (e.g., X-VLA R²=0.108 overall, Pi0-FAST R²=−4.574), performing worse than a lightweight ACT baseline. We show that combining three physical inductive biases—finger-distance-conditioned state encoding, GRU temporal modeling, and an uncertainty-aware NLL head with auxiliary contact loss—and routing them through an Adaptive Mixture-of-Experts head (EgoForce) yields the strongest model overall: R²=0.703 and MAE=4.550 N at ~37M parameters, outperforming every billion-parameter foundation model on every material category.

Motivation

The Physical Intuition Gap

Human manipulation relies on internal sensorimotor models that generate visual feedforward priors before contact. Current VLA models lack this capacity.

🧠

Human Internal Model

Feedforward + Feedback

Cognitive science shows that humans employ visual feedforward priors before contact. Seeing a full cup triggers anticipatory scaling of grasp force. A deceptive empty milk carton causes measurable force overshoot—proof of visual prior formation.

Feedforward: Initial force ∝ visual water level
Feedback: Tactile correction after contact transient

🤖

Current VLA Models

Vision-Only · No Physical Prior

Models like RT-2 and OpenVLA excel at semantic understanding and instruction following, yet they lack explicit physical grounding. They may correctly pick up a potato chip, but without force intuition, the result is often catastrophic failure at safety boundaries.

Problem 1: No force scaling from visual cues
Problem 2: Brittle under visual-physical mismatch

Core Scientific Question

"Can current VLA models learn this human-like physical intuition, and under what circumstances do they struggle?"

Benchmark

The FEEL Benchmark & Data Engine

A reproducible visuo-tactile collection setup with hardware blueprints and spatio-temporal synchronization algorithms.

Data Collection Pipeline

👓

Meta Aria Glasses

6-DoF wrist & finger poses via MPS · mm accuracy

✋

Paxini PX6AX Tactile Sensor

PX6AX-GEN3-DP-S2015-Elite on thumb & index · high-res fingertip force / deformation

⏱️

Spatio-Temporal Sync

Aligns high-freq visual/IMU streams with tactile readings

📐

3D-Print Blueprint

Open-source STL + BOM for low-cost sensor mount

📦

FEEL Dataset

F/D/V taxonomy · boundary & mismatch samples

F/D/V Physical Taxonomy

Fragile

Brittle failure modes requiring precise force thresholding to avoid catastrophic damage.

Potato chipsCrackers

Deformable

Elastic/plastic behaviors requiring visual deformation-to-force mapping.

SpongeBreadBottles

Variable-Load

Objects whose effective weight varies with content or fill level, requiring force estimation under a changing, visually-inferred load.

Water cupsRice bagsMilk cartons

Three Controlled Probe Types

⚠️

Boundary Samples

Demonstrators deliberately approach or exceed unsafe force limits (e.g., crushing a potato chip). Tests safety-boundary prediction.

🌊

Water-Level Prior Samples

Cups filled at 0%, 25%, 50%, 75%, 100% capacity. Tests whether initial force scales with visual perception of weight.

🎭

Visual-Physical Mismatch

Visual appearance conflicts with true mass (empty can). Tests force overshoot and human-like "size-weight illusion" reproduction.

Method

Technical Contributions for Force Learning

Three targeted contributions that together improve continuous contact force prediction in egocentric VLA settings.

NLL Uncertainty-Aware Prediction Head

Standard deterministic regression treats all force errors equally, failing to account for label noise and the inherent variability of boundary samples. We replace the regression head with an NLL formulation, making the model simultaneously predict force mean and uncertainty. This explicitly dampens the impact of noisy near-failure labels and produces calibrated predictions.

L_NLL = (F − μ)² / (2σ²) + log σ

High σ on boundary samples means the model correctly signals "I am uncertain here"—actionable for downstream safety planning. EgoForce w/o MoE (NLL+GRU+Contact) reaches Sponge R²=0.656 and the lowest 0–5 N MAE (1.66 N) overall.

Finger-Distance-Conditioned Contact State Encoding

RGB images alone cannot distinguish approach, first contact, or firm grasp. We inject the thumb–index distance as an additional input feature whose trajectory continuously encodes contact phase—information invisible to the camera.

d_ti = ‖p_thumb − p_index‖₂ ; z_ACS = [s_6-DoF ⊕ d_ti]

Ablation: removing this input raises Jelly MAE from 5.115→5.854 (+14%) and drops R² by 0.172. Contact-phase context that pure visual features cannot recover.

AdaMoE: Cross-Scenario Expert Routing

Different materials (Fragile, Deformable, Soft) exhibit heterogeneous force patterns that a single model struggles to capture. AdaMoE uses a scene-adaptive router to dynamically combine a shared prediction head with multiple expert heads (top-k selection). A Scale Adapter decouples expert selection from contribution weighting, avoiding the load-balance–accuracy trade-off of vanilla MoE.

ŷ = w_shared·h_shared + Σ_k w_k·h_expert-k

Router input: object one-hot + GRU hidden state. Result: AdaMoE improves Sponge R² but shows cross-material trade-offs on Jelly, revealing that material heterogeneity requires adaptive modeling.

Experiments

Offline Evaluation Results

Real results on Jelly & Sponge materials (3 runs ±σ). Strong VLA backbones do not automatically outperform targeted force models on contact force estimation.

Method	MAE (N) ↓	RMSE (N) ↓	R² ↑	Pearson r ↑	Contact F1 ↑	Speed
Ours (NLL + Temporal)
EgoForce ⭐	4.550	8.272	0.703	0.858	0.903	~1.67 s/step
Baselines
ACT	5.318	9.960	0.569	0.830	0.939	~1.2 s/step
WALL-X	5.420	9.455	0.611	0.803	0.867	~1.22 s/step
Large VLA Models
SmolVLA	7.627	12.440	0.327	0.715	0.862	~2.4 s/step
Pi0	6.243	11.069	0.468	0.707	0.865	~4.3 s/step
Pi0.5	6.451	11.245	0.451	0.698	0.879	~5.4 s/step
X-VLA	8.707	14.339	0.108	0.585	0.857	~2.1 s/step
Pi0-FAST	16.088	32.543	−4.574	0.015	0.843	~10.8 s/step

Strong VLA backbones fail on continuous force estimation

Pi0-FAST collapses to overall R²=−4.574 and X-VLA to R²=0.108—both billion-parameter VLAs perform worse than the lightweight ACT baseline (R²=0.569) on continuous force curves. FEEL's controlled protocol isolates this failure mode that standard action benchmarks completely miss. Our EgoForce achieves overall R²=0.703.

✅ Normal Case — Full Can

Visual appearance matches actual weight. No force overshoot. Model correctly scales force.

Correct Prior · No Overshoot

⚡ Mismatch — Empty Can (Deceptive)

Model applies heavy-can prior. Dramatic initial overshoot followed by correction—mirroring human "size-weight illusion."

Visual Overshoot · Human-Like Misjudgment

✓

Models reproduce human-like force overshoot under visual deception

The force overshoot in visuo-tactile models mirrors the human "size-weight illusion," confirming that FEEL models build genuine visual force priors—not just kinematic imitation. This misjudgment is a feature, not a bug: it proves that visual cues influence force prediction.

✅ Normal Case — Full Can

Visual appearance matches actual weight. No force overshoot. Model correctly scales initial force.

Correct Prior · No Overshoot

⚡ Mismatch — Empty Can (Deceptive)

Model applies heavy-can prior. Dramatic initial overshoot—mirroring the human "size-weight illusion." This misjudgment is evidence of prior formation, not a bug.

Visual Overshoot · Human-Like Misjudgment

✓

Visual-mismatch overshoot proves prior formation, not kinematic imitation

A model that overshoots on the empty-can probe has built a genuine visual force prior—the same "size-weight illusion" humans experience. Models that don't overshoot are simply not forming physical priors, regardless of their semantic task performance. FEEL uses this as a diagnostic signal.

Two paper-aligned analyses unpack what each inductive bias contributes (single seed = 1000). Top: ablation across our three GRU-based prediction heads (paper Table 2). Bottom: per-model MAE stratified by ground-truth force magnitude (paper Table 3).

GRU Head Ablation (Paper Table 2)

Variant	Overall R² ↑	Overall MAE ↓	Sponge R² ↑	Empty B. R² ↑	Full B. R² ↑	0–5 N MAE ↓
EgoForce w/o MoE & NLL & Contact (L1 head)	0.677	4.916	0.123	0.641	0.388	3.47
EgoForce w/o MoE & Contact	0.630	4.986	0.623	0.623	0.277	2.84
EgoForce w/o MoE (NLL+GRU+Contact)	0.629	4.831	0.656	0.368	0.284	1.66

All three variants share the ResNet-18 tactile encoder, ACS state encoder, and GRU temporal backbone—differing only in the prediction head. Plain L1 wins overall R² and the high-force Empty / Full Bottle categories; NLL+Contact wins the low-force regimes (Sponge R²=0.656, 0–5 N MAE=1.66 N) thanks to heteroscedastic uncertainty + auxiliary contact loss. No single head wins everywhere — motivating the material-aware AdaMoE router (overall R²=0.703).

Force-Range-Stratified MAE (Paper Table 3)

Model	0–5 N MAE ↓	5–20 N MAE ↓	20–50 N MAE ↓
EgoForce w/o MoE	1.66	3.81	13.42
EgoForce ⭐	2.66	3.11	10.95
EgoForce w/o MoE & Contact	2.84	3.62	12.77
ACT	3.22	3.86	13.73
EgoForce w/o MoE & NLL & Contact	3.47	3.47	11.05
WALL-X	4.44	3.59	11.69
Pi0	4.38	4.09	13.08
Pi0.5	4.75	4.31	13.34
SmolVLA	7.59	5.37	14.49
X-VLA	9.42	5.80	13.33
Pi0-FAST	12.76	13.29	22.19

Every billion-parameter VLA degrades faster in the low-force regime than mid/high — SmolVLA and X-VLA lose 7.6–9.4 N on 0–5 N samples while keeping 13–15 N error on the much larger 20–50 N targets — suggesting they bias toward the population-mean force rather than tracking contact-onset transients.

✓

Different inductive biases win different force regimes

Within the GRU family, NLL+Contact decisively wins the low-force regime (0–5 N MAE=1.66 N) where heteroscedastic uncertainty pays off, while plain L1 takes over in the high-force Empty / Full Bottle categories where mean prediction matters more. AdaMoE routes per-step to a shared head plus top-2 specialised experts, lifting overall R² from 0.629 to 0.703 and winning Jelly, Sponge, and Full Bottle simultaneously.

For each material, all 9 models predict on the same source clip with the predicted force curve overlaid against the ground truth. Switching the material tab tears down the previous 9 videos and rebuilds a fresh 3×3 grid, keeping at most 9 active video decoders in the DOM.

Fragile

Deformable

Rigid

Variable-Load

Order (row-major): EgoForce (ours), WALL-X, ACT, Pi0, Pi0.5, SmolVLA, X-VLA, GR00T, OpenTouch. All 9 clips play muted-autoplay-loop with drift correction against the top-left reference.

Core Insight from Real Data

R²=0.703

EgoForce (Ours) — best overall R²

R² < 0

Pi0-FAST — large VLA fails on force

R²=0.569

ACT — lightweight baseline beats massive VLAs

Key Findings

What FEEL Reveals

Four empirically grounded findings that distinguish FEEL from conventional action benchmarks.

Strong VLA backbones ≠ better force estimation

Massive VLA backbones produce low or negative overall R² on continuous force (X-VLA R²=0.108, SmolVLA 0.327, Pi0-FAST −4.574), often worse than the lightweight ACT baseline (R²=0.569). Their action-generation strengths do not transfer to continuous contact force prediction. FEEL's controlled protocol exposes this failure mode that standard task-success metrics completely hide.

Both visual input and hand state are necessary signals

Within our GRU family, removing the proprioceptive state cuts Jelly R² roughly in half, with thumb–index finger-distance d_ti as the dominant recoverable component (paper §5.2). Egocentric vision provides material/deformation cues; finger-distance+pose provides contact geometry. Models that ignore either modality cannot predict realistic force curves.

Temporal modeling + uncertainty awareness outperforms bigger backbones

EgoForce (~37M params) outperforms every billion-parameter foundation baseline (Pi0 578M, WALL-X 4.2B, Pi0-FAST 3.3B). The bottleneck for contact-force learning is not model scale but the right inductive biases: explicit temporal context (GRU), calibrated uncertainty (NLL head with auxiliary contact loss), and material-aware Mixture-of-Experts routing.

Material heterogeneity requires adaptive modeling; single models trade off

EgoForce's AdaMoE router lifts overall R² from 0.629 (no-MoE baseline) to 0.703 and is the only configuration simultaneously best on Jelly, Sponge, and Full Bottle. Routing analysis reveals partial expert specialisation along low-force (Sponge / Empty Bottle) versus high-force (Full Bottle) regimes, with Jelly shared—evidence that FEEL's F/D/V taxonomy captures physically meaningful variation.

Summary

Contributions

🔬

FEEL: Controlled Force-Intuition Probing Benchmark

A benchmark specifically designed for testing whether VLA models develop human-like visual force priors—not a general egocentric tactile corpus. OpenTouch proved wearable touch collection is feasible; FEEL advances this with controlled counterfactual protocols (water-level, boundary, visual-mismatch), enabling cognitive-style probing of physical intuition in foundation models.
🎯

NLL Uncertainty Head + AdaMoE Expert Routing

Two architectural contributions targeting the specific challenges of contact force learning: (1) NLL prediction head for uncertainty-aware regression under label noise and boundary variability; (2) AdaMoE scene-adaptive routing with Scale Adapter, which decouples expert selection from contribution weighting and reveals material heterogeneity as a first-class modeling challenge.
📐

Finger-Distance-Conditioned Contact State Encoding

Thumb–index distance injected as input feature encoding contact phase (approach, onset, stable grasp)—invisible to RGB alone. Ablation: removing this input raises Jelly MAE by +14% and drops R² by 0.172.
💡

Architectural Insight: Scale ≠ Physical Intelligence

Systematic evidence that state-of-the-art generative VLAs (Pi0, Pi0.5, X-VLA) fail on continuous contact force estimation despite excelling at semantic tasks. The bottleneck is not model scale but modeling inductive bias—temporal context, uncertainty calibration, and scene-adaptive routing. FEEL provides the controlled benchmark to measure this gap.

Overview

The Physical Intuition Gap

Human Internal Model

Current VLA Models

The FEEL Benchmark & Data Engine

Data Collection Pipeline

Meta Aria Glasses

Paxini PX6AX Tactile Sensor

Spatio-Temporal Sync

3D-Print Blueprint

FEEL Dataset

F/D/V Physical Taxonomy

Fragile

Deformable

Variable-Load

Three Controlled Probe Types

Boundary Samples

Water-Level Prior Samples

Visual-Physical Mismatch

Egocentric Footage

Technical Contributions for Force Learning

NLL Uncertainty-Aware Prediction Head

Finger-Distance-Conditioned Contact State Encoding

AdaMoE: Cross-Scenario Expert Routing

Offline Evaluation Results

Strong VLA backbones fail on continuous force estimation

Models reproduce human-like force overshoot under visual deception

Visual-mismatch overshoot proves prior formation, not kinematic imitation

GRU Head Ablation (Paper Table 2)

Force-Range-Stratified MAE (Paper Table 3)

Different inductive biases win different force regimes

What FEEL Reveals

Strong VLA backbones ≠ better force estimation

Both visual input and hand state are necessary signals

Temporal modeling + uncertainty awareness outperforms bigger backbones

Material heterogeneity requires adaptive modeling; single models trade off

Contributions

FEEL: Controlled Force-Intuition Probing Benchmark

NLL Uncertainty Head + AdaMoE Expert Routing

Finger-Distance-Conditioned Contact State Encoding

Architectural Insight: Scale ≠ Physical Intelligence

Open-Source Hardware Blueprint

BibTeX