Video generation models have progressed tremendously through large latent diffusion transformers trained with rectified flow techniques. Yet, despite these advances, these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes.
3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. This work explores how simple epipolar geometry constraints can improve modern video diffusion models trained on internet-scale datasets. Despite their massive training data, these models often fail to capture the fundamental geometric principles underlying all visual content.
While traditional computer vision methods are often non-differentiable and computationally expensive, they provide reliable, mathematically grounded signals for 3D consistency evaluation. We demonstrate that aligning diffusion models through a preference-based optimization framework using pairwise epipolar geometry constraints yields videos with superior visual quality, enhanced 3D consistency, and significantly improved motion stability.
Our approach bridges modern video diffusion models with classical computer vision algorithms using epipolar geometry constraints as reward signals in a preference-based finetuning framework.
Generate diverse videos using pretrained generators and compute their epipolar geometry consistency scores using the Sampson distance to identify well-constrained vs. unconstrained examples.
Train policy using Flow-DPO to prefer geometrically consistent outputs by learning from paired examples ranked by epipolar error, without requiring differentiable rewards.
Apply the updated policy to enhance 3D consistency in the base video diffusion model, leading to more stable camera trajectories and reduced artifacts.
Our epipolar-aligned model significantly reduces artifacts and enhances motion smoothness, resulting in more geometrically consistent 3D scenes. To demonstrate our model, we present examples where the generated video from the original model lacks 3D consistency. Below you can toggle between baseline model results and our approach for direct comparison.
We evaluate our approach using VBench metrics, and direct 3D geometry evaluations. For 3D reconstruction (PSNR, SSIM, LPIPS), we measure 3D scene reconstruction quality with Gaussian splatting.
| Method | Sampson Error ↓ | Perspective ↑ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | Human Eval |
|---|---|---|---|---|---|---|
| Baseline | 0.190 | 0.426 | 22.32 | 0.706 | 0.343 | 54.1% |
| Ours | 0.131 | 0.428 | 23.13 | 0.729 | 0.315 | 71.8% |
| Method | Background Consistency | Aesthetic Quality | Temporal Flickering | Motion Smoothness |
|---|---|---|---|---|
| Baseline | 0.930 | 0.541 | 0.958 | 0.981 |
| Ours | 0.942 | 0.551 | 0.969 | 0.984 |
We presented a novel approach for enhancing 3D consistency in video diffusion models by leveraging classical epipolar geometry constraints as preference signals. Our work demonstrates that aligning modern generative models with fundamental geometric principles can significantly improve the spatial coherence of generated content without requiring complex 3D supervision.
The resulting models generate videos with notably fewer geometric inconsistencies and more stable camera trajectories while preserving creative flexibility. This work highlights how classical computer vision algorithms can effectively complement deep learning approaches, addressing limitations in purely data-driven systems.