STARFlow-V is the first normalizing flow-based causal video generator demonstrating that normalizing flows can match video diffusion models in visual quality while offering end-to-end training, exact likelihood estimation, and native multi-task support across T2V/I2V/V2V generation.
Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models.

Figure: STARFlow-V pipeline. The model processes text prompts and noise through a Deep Autoregressive Block (global temporal reasoning) to produce intermediate latents, which are then refined by Shallow Flow Blocks (local within-frame details). A Learnable Causal Denoiser (trained via Flow-Score Matching) cleans the output. The model is trained end-to-end with two objectives: Maximum Likelihood for the flow and Flow-Score Matching for the denoiser.
1
A novel two-level architecture that separates global temporal reasoning from local within-frame details. A deep causal Transformer block processes the video autoregressively in compressed latent space to capture long-range spatiotemporal dependencies, while shallow flow blocks operate independently on each frame to model rich local structures. This design mitigates compounding errors common in pixel-space autoregressive models.
2
A unified training framework that combines normalizing flow maximum likelihood with flow-score matching for denoising. Instead of using imperfect or non-causal denoisers, we train a lightweight causal neural denoiser alongside the main flow model. This denoiser learns to predict the score (gradient of log-probability) of the model's own distribution, enabling high-quality single-step refinement while preserving causality.
3
Generation (flow inversion) is recast as solving a nonlinear system, enabling block-wise parallel updates of multiple latents simultaneously instead of one-by-one generation. Combined with video-aware initialization that uses temporal information from adjacent frames and pipelined execution between deep and shallow blocks, this achieves significant speedup while maintaining generation quality.
STARFlow-V is trained on 70M text-video pairs and 400M text-image pairs, with a final 7B parameter model that can generate 480p video at 16fps. The model operates in a compressed latent space and leverages the invertible nature of normalizing flows to natively support multiple generation tasks without any architectural changes or retraining.
Navigate through the tabs above to see our model's capabilities across different generation tasks. Each category demonstrates specific aspects of STARFlow-V, from standard text-to-video generation to long-form video creation and comparisons with diffusion-based baselines.
If you find STARFlow-V useful in your research, please consider citing our work:
@article{gu2025starflowv,
title={STARFlow-V: End-to-End Video Generative Modeling with Scalable Normalizing Flows},
author={Gu, Jiatao and Shen, Ying and Chen, Tianrong and Dinh, Laurent and Wang, Yuyang and Bautista, Miguel \'Angel and Berthelot, David and Susskind, Josh and Zhai, Shuangfei},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2025}
}
Our model generates high-quality videos directly from text descriptions.
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
Generate videos from input images while maintaining temporal consistency. Due to the autoregressive nature of our model, we don't need to change the architecture at all—one model handles all tasks seamlessly.

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s

Input Image
480p • 16fps • 5s
Our model can extend and transform existing videos while maintaining temporal consistency. Due to the autoregressive nature of our model, we don't need to change the architecture at all—one model handles all tasks seamlessly.
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
384p • 16fps • 2s
Extended video generation (10s, 15s, 30s) using autoregressive segment-by-segment generation. The tail of each 5s segment is re-encoded as the prefix for the next segment, leveraging the invertibility of normalizing flows.
480p • 16fps • 10s
480p • 16fps • 10s
480p • 16fps • 10s
480p • 16fps • 10s
480p • 16fps • 10s
480p • 16fps • 10s
480p • 16fps • 10s
480p • 16fps • 10s
480p • 16fps • 10s
480p • 16fps • 13s
480p • 16fps • 15s
480p • 16fps • 15s
480p • 16fps • 30s
480p • 16fps • 30s
480p • 16fps • 30s
480p • 16fps • 30s
Side-by-side comparisons with baseline Autoregressive diffusion models. All prompts are sampled from VBench (Huang, 2023). Each video shows three methods from left to right: NOVA (https://github.com/baaivision/NOVA), WAN-Causal (finetuned from WAN provided by https://huggingface.co/gdhe17/Self-Forcing/blob/main/checkpoints/ode_init.pt), and STARFlow-V (Ours).
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 4s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 6s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 5s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 3s
NOVA WAN-Causal STARFlow-V
480p • 16fps • 7s
Examples where our model struggles or produces suboptimal results, particularly on complex motion and physical interactions. These limitations stem from: (1) insufficient training due to resource constraints, (2) low-quality training data, and (3) the absence of post-training refinement—we perform only pretraining without supervised fine-tuning (SFT) or reinforcement learning (RL).
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s
480p • 16fps • 5s