RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation

Each clip is generated from a single input frame and a target camera trajectory; the inset shows the camera motion. Use the arrows to browse different controls of the same scene.

Abstract

Modern video diffusion transformers place tokens on a 2D pixel grid and encode their positions (e.g., RoPE over the u, v, t axes). These encodings describe the camera's sampling grid rather than the 3D structure of the scene. RayPE injects per-token Plücker coordinates additively into the queries and keys of self-attention; a query/key flip makes the Euclidean inner product reduce to the Plücker reciprocal product even at zero learning, providing a built-in 3D inductive bias. A Normalize-Gate-Inject design decouples ray direction from ray-moment magnitude. The module adds <0.1% parameters, is zero-initialized, and coexists with the original RoPE rather than replacing it.

RayPE teaser: given a target camera trajectory, RayPE generates videos that faithfully follow the trajectory while preserving the base model's quality.
RayPE enables precise relative camera control for pretrained video diffusion models. Given a target camera trajectory, it generates videos that faithfully follow the path while preserving the base model's generation quality. Top: image-to-video (left) and text-to-video (right). Bottom: out-of-distribution generalization to movie stills.

Key Ideas


Same scene · many cameras

Scenes

One generated scene browsed under six different camera trajectories. The inset visualizes the commanded camera path.

WASD · third-person control

Motions

WASD inputs drive the camera trajectory that moves the third-person subject. Overlay keys reflect the per-frame motion.

More results

Real-World Scenes

Diverse real captures driven along varied camera paths from a single frame.

More results

In-the-Wild Videos

Everyday footage re-rendered under controllable camera motion.

More results

Stylized & Artistic Scenes

Illustrated and painterly inputs animated with 3D-consistent camera control.