Late-May VLA / WAM / World Model Reading List Vol. 1

This is a compact reading directory for ten papers I would prioritize from the late-May VLA / WAM / World Model batch. The common thread is clear: the next step is less about making the action head larger, and more about adding memory, scene state, world rollout, and runtime verification around the policy.

Reading List

Pre-VLA (2605.22446)
VLA / World Model / Runtime Verification
Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts
The useful shift is from asking whether a VLA can generate an action to asking whether the action can be checked before execution.
SOMA (2605.22283)
VLA / Spatial Memory / Manipulation
Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action
The interesting piece is persistent spatial memory, especially for manipulation targets outside the current field of view.
EvoScene-VLA (2605.21862)
VLA / Scene Belief / Chunked Robot Control
EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control
Scene belief is moved into the action decoder, which matters because chunked control changes the scene it later needs to reason about.
GaussianDream (2605.20752)
World Model / 3D Gaussian / Robotic Manipulation
GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation
A clean 3D world-model plug-in direction, using 3D Gaussian structure for spatial and future-state supervision without paying heavy rollout cost at inference.
VLA-REPLICA (2605.20774)
VLA / Real-World Benchmark / Evaluation
VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models
Low-cost real-world evaluation is important because this field has more demos than reproducible hardware benchmarks.
WorldArena 2.0 (2605.17912)
Embodied World Model / Benchmark / Interactive RL
WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform
Embodied world-model evaluation is expanding beyond vision-only prediction toward modality coverage, platform diversity, and interactive RL utility.
Demo-JEPA (2605.20811)
World Model / JEPA / Cross-Embodiment Imitation
Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation
Imitation is reframed as predicting the future latent state a demonstration is trying to reach, which is a better fit for cross-embodiment transfer.
SWEET (2605.19319)
World Model / Image Editing / Sparse Visual Planning
SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution
Sparse keyframe prediction via image editing may be more practical than full video rollout for many manipulation settings.
PhyWorld (2605.19242)
World Model / Video Generation / Physical AI
PhyWorld: Physics-Faithful World Model for Video Generation
If video models are going to act as Physical AI simulators, the metric has to include physics-faithful continuation, not only visual plausibility.
AVP (2605.22183)
VLA / Visual Primitives / Action Expert
Action with Visual Primitives
The clean takeaway is to externalize VLM spatial understanding as visual primitive tokens before handing control to a flow action expert.