Video is harder than still images (temporal coherence)
Independent per-frame generation produces flicker: object identity, lighting, and texture jump between frames even when each still looks plausible [2]. Video models must couple motion, lighting, and identity across time [2].
Human vision integrates motion over roughly 100 ms; frame-wise independence violates that prior, so artifacts read as unnatural even when per-frame FID looks good [2].
Architectures differ: temporal attention lets frames attend to neighbors; 3D convolutions share weights across space and time; latent video diffusion operates on spatio-temporal VAE codes to reduce pixel redundancy [2].

Compute and memory scale with clip length times resolution. Weak temporal modeling yields morphing identity, warped anatomy, or jerky motion that users notice immediately [2].

Still-image quality is necessary but not sufficient for convincing video [2].
Video pipelines often predict latents per clip, then decode with a temporal VAE. That mirrors LLM tokenization: compress redundancy before the expensive generative core operates [2].
Optical flow and motion priors from classical vision still appear as auxiliary losses in some video models to stabilize temporal gradients [2].
Benchmarks for video generation often include user studies on motion smoothness alongside per-frame sharpness; optimizing only still metrics misaligns with viewer experience [2].
Sources
Related cards
Video Content
Tasks
Card Info
- Topic: Machine learning
- Difficulty: Intermediate
- Completed: 0 users