Video is harder than still images (temporal coherence)

Independent per-frame generation produces flicker: object identity, lighting, and texture jump between frames even when each still looks plausible [2]. Video models must couple motion, lighting, and identity across time [2].

Human vision integrates motion over roughly 100 ms; frame-wise independence violates that prior, so artifacts read as unnatural even when per-frame FID looks good [2].

Architectures differ: temporal attention lets frames attend to neighbors; 3D convolutions share weights across space and time; latent video diffusion operates on spatio-temporal VAE codes to reduce pixel redundancy [2].

Compute and memory scale with clip length times resolution. Weak temporal modeling yields morphing identity, warped anatomy, or jerky motion that users notice immediately [2].

Still-image quality is necessary but not sufficient for convincing video [2].

Video pipelines often predict latents per clip, then decode with a temporal VAE. That mirrors LLM tokenization: compress redundancy before the expensive generative core operates [2].

Optical flow and motion priors from classical vision still appear as auxiliary losses in some video models to stabilize temporal gradients [2].

Benchmarks for video generation often include user studies on motion smoothness alongside per-frame sharpness; optimizing only still metrics misaligns with viewer experience [2].

Sources

[2]https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi Return to text

Video is harder than still images (temporal coherence)

Sources

Related cards

Video Content

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator