Epipolar Geometry Improves Video Generation Models

Abstract

Video generation models have progressed tremendously through large latent diffusion transformers trained with rectified flow techniques. Yet, despite these advances, these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes.

3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. This work explores how simple epipolar geometry constraints can improve modern video diffusion models trained on internet-scale datasets. Despite their massive training data, these models often fail to capture the fundamental geometric principles underlying all visual content.

While traditional computer vision methods are often non-differentiable and computationally expensive, they provide reliable, mathematically grounded signals for 3D consistency evaluation. We demonstrate that aligning diffusion models through a preference-based optimization framework using pairwise epipolar geometry constraints yields videos with superior visual quality, enhanced 3D consistency, and significantly improved motion stability.

Method

Our approach bridges modern video diffusion models with classical computer vision algorithms using epipolar geometry constraints as reward signals in a preference-based finetuning framework.

Generate Paired Videos

Generate diverse videos using pretrained generators and compute their epipolar geometry consistency scores using the Sampson distance to identify well-constrained vs. unconstrained examples.

Preference-Based Optimization

Train policy using Flow-DPO to prefer geometrically consistent outputs by learning from paired examples ranked by epipolar error, without requiring differentiable rewards.

Enhanced Generation

Apply the updated policy to enhance 3D consistency in the base video diffusion model, leading to more stable camera trajectories and reduced artifacts.

Results

Our epipolar-aligned model significantly reduces artifacts and enhances motion smoothness, resulting in more geometrically consistent 3D scenes. To demonstrate our model, we present examples where the generated video from the original model lacks 3D consistency. Below you can toggle between baseline model results and our approach for direct comparison.

Basketball Court

Chinese Temple

Notice how our model preserves consistent 3D structure throughout camera movements, maintaining shape and scale of the temple parts.

House Front Door

The epipolar-aligned model generates stable perspective with fewer artifacts during camera movement.

Residential Complex Exterior

Our epipolar-aligned model maintains structural integrity of the apartment buildings with consistent perspective as the camera moves. Notice how buildings remain geometrically stable, eliminating the warping and distortion seen in the baseline.

Living Room interior

In the baseline model, the sofa warps unnaturally as the camera moves through, creating a distracting perspective distortion. Our approach maintains proper object proportions and shape consistency throughout the camera trajectory, preserving the realistic appearance of furniture and interior elements.

Outdoor Playground Climbing Wall

The baseline model exhibits noticeable distortions, particularly in finer details like climbing holds and netting as the camera moves. Our epipolar alignment preserves the geometric consistency of these small elements and maintains proper perspective relationships between the climbing wall and background trees, creating a more realistic outdoor scene.

Elegant Staircase with Red Carpet

The baseline model shows noticeable distortion in the staircase geometry as the camera moves. Our approach maintains proper perspective, preserving straight lines and consistent architectural features throughout the camera trajectory.

Bustling Historical Marketplace

Even in this highly dynamic scene with numerous people, our method significantly reduces the motion artifacts and temporal inconsistencies seen in the baseline. Building facades maintain proper perspective while moving subjects appear more naturally integrated with the environment, demonstrating our approach's effectiveness beyond just static scenes.

Classic Car Show with Motion

The baseline model struggles with maintaining the scene consistency during camera movement, creating many warping artifacts. Our approach preserves vehicle geometry and spatial relationships between moving cars, demonstrating how epipolar constraints improve even scenes with moving objects.

Mountain Biking Trail Perspective

The baseline model struggles with the combined motion of the cyclist and camera, creating unnatural texture warping and geometric distortions in the rocky terrain. Our approach maintains consistent rock textures and landscape features while preserving the sense of motion, demonstrating effectiveness in first-person action sequences.

Car Driving Through The Dock

The baseline struggles to model the collision of the car with the objects and maintain the smooth trajectory causing unnatural warping and perspective distortions of the objects.

Quantitative Evaluation

We evaluate our approach using VBench metrics, and direct 3D geometry evaluations. For 3D reconstruction (PSNR, SSIM, LPIPS), we measure 3D scene reconstruction quality with Gaussian splatting.

3D Consistency & Reconstruction

Method	Sampson Error ↓	Perspective ↑	PSNR ↑	SSIM ↑	LPIPS ↓	Human Eval
Baseline	0.190	0.426	22.32	0.706	0.343	54.1%
Ours	0.131	0.428	23.13	0.729	0.315	71.8%

VBench Metrics

Method	Background Consistency	Aesthetic Quality	Temporal Flickering	Motion Smoothness
Baseline	0.930	0.541	0.958	0.981
Ours	0.942	0.551	0.969	0.984

Conclusion

We presented a novel approach for enhancing 3D consistency in video diffusion models by leveraging classical epipolar geometry constraints as preference signals. Our work demonstrates that aligning modern generative models with fundamental geometric principles can significantly improve the spatial coherence of generated content without requiring complex 3D supervision.

The resulting models generate videos with notably fewer geometric inconsistencies and more stable camera trajectories while preserving creative flexibility. This work highlights how classical computer vision algorithms can effectively complement deep learning approaches, addressing limitations in purely data-driven systems.

Epipolar Geometry ImprovesVideo Generation Models

Abstract

Method

Results

Quantitative Evaluation

Conclusion

Citation

Epipolar Geometry Improves
Video Generation Models