Flow Q-Learning

Seohong Park
UC Berkeley
Qiyang Li
UC Berkeley
Sergey Levine
UC Berkeley

Paper
Code

Overview

Flow Q-learning (FQL) is a simple and effective method that trains an expressive flow-matching policy for data-driven reinforcement learning (RL), including offline RL and offline-to-online RL.
FQL is simple! Thanks to the simplicity of flow matching, FQL can be implemented in a few lines on top of an existing actor-critic framework, without requiring tuning noise or sampling strategies.
FQL is scalable and performant. FQL's expressive flow policy can handle data with arbitrarily complex, multimodal action distributions, achieving strong performance on a variety of challenging robotic locomotion and manipulation tasks.

$ \definecolor{plgray}{RGB}{153, 153, 153} \newcommand{\pl}[1]{{\color{plgray} #1}} $

Challenge

Training a flow policy with behavioral cloning (BC) is straightforward. We can simply train a velocity field $v_\theta(\pl{t}, \pl{s}, \pl{x}): [0, 1] \times \mathcal{S} \times \mathbb{R}^d \to \mathbb{R}^d$ with the standard flow-matching loss (see this excellent tutorial if you're not familiar with flow matching!): $$\begin{aligned} \mathbb{E}_{\substack{s, a=x^1 \sim \mathcal{D}, \\ x^0 \sim \mathcal{N}(0, I_d), \\ t \sim \mathrm{Unif}([0, 1])}} \left[\|v_\theta(t, s, x^t) - (x^1 - x^0)\|_2^2\right], \end{aligned}$$ where $x^t = (1 - t)x^0 + tx^1$ is a linear interpolation between two samples. Note that flow matching happens in the action space $\mathcal{A} = \mathbb{R}^d$.
The resulting velocity field $v_\theta$ will generate a flow that transforms the unit Gaussian into the behavioral action distribution, $\pi^\beta(\pl{a} \mid \pl{s})$.
But, how can we train a flow policy with reinforcement learning to maximize rewards? Our goal is, given a dataset $\mathcal{D}$ of transitions $(s, a, r, s')$, to train a flow policy that maximizes the expected return while not deviating too much from the dataset.
This is a highly non-trivial problem because flow policies are iterative. Naïvely maximizing values with a flow policy will require recursive backpropagation through the flow, which often leads to unstable training and suboptimal performance.

Idea

We propose a new, simple, and effective mechanism to train a flow policy with RL, which we call flow Q-learning (FQL).
Our main idea is to train a separate, expressive one-step policy with RL, while training the flow policy only with behavioral cloning. We train the one-step policy to maximize Q-values while regularizing it with distillation from the BC flow policy. We call this technique one-step guidance.
This way, we can completely avoid tricky problems with guiding iterative generative models (e.g., recursive backpropagation), because we're only doing RL with the one-step policy!
Moreover, the output of this procedure is the efficient one-step policy, which doesn't involve any iterative computation at test time.
But, is the one-step policy expressive enough to capture complex action distributions? The answer is yes! Even for image generation, many previous works have shown that distilled one-step models are still capable of generating high-quality samples (e.g., rectified flow, shortcut models).

Algorithm

Click to see the full algorithm

We now fully describe FQL's objectives. It has three components: critic $Q_\phi(\pl{s}, \pl{a})$, BC flow policy $\mu_\theta(\pl{s}, \pl{z})$, and one-step policy $\mu_\omega(\pl{s}, \pl{z})$.
The critic $Q_\phi(\pl{s}, \pl{a})$ is trained with the standard Bellman loss, as usual.
The BC flow policy $\mu_\theta(\pl{s}, \pl{z})$ is trained only with the BC flow-matching loss.
The one-step policy $\mu_\omega(\pl{s}, \pl{z})$ is trained to maximize values while distilling the BC flow policy with the following loss: $$\begin{aligned} \underbrace{\mathbb{E}_{s \sim \mathcal{D}, a^\pi \sim \pi_\omega}[-Q_\phi(s, a^\pi)]}_{\texttt{Q loss}} + \alpha \cdot \underbrace{\mathbb{E}_{s \sim \mathcal{D}, z \sim \mathcal{N}(0, I_d)} \left[\|\mu_\omega(s, z) - \mu_\theta(s, z)\|_2^2\right]}_{\texttt{Distillation loss}}, \end{aligned}$$ $$\begin{aligned} &\underbrace{\mathbb{E}_{s \sim \mathcal{D}, a^\pi \sim \pi_\omega}[-Q_\phi(s, a^\pi)]}_{\texttt{Q loss}} \\ + &\alpha \cdot \underbrace{\mathbb{E}_{s \sim \mathcal{D}, z \sim \mathcal{N}(0, I_d)} \left[\|\mu_\omega(s, z) - \mu_\theta(s, z)\|_2^2\right]}_{\texttt{Distillation loss}}, \end{aligned}$$ where $\alpha$ is a hyperparameter that balances the two losses, and $\pi_\omega$ denotes the stochastic policy induced by the deterministic map $\mu_\omega(\pl{s}, \pl{z})$.
The output of FQL is the one-step policy $\mu_\omega(\pl{s}, \pl{z})$.
Note that the Q loss doesn't involve recursive backpropagation, thanks to the one-step policy!

Experiments

Tasks

antmaze

humanoidmaze

antsoccer

cube

scene

puzzle

We use the recently proposed OGBench task suite as the main benchmark. OGBench provides a number of diverse, challenging tasks across robotic locomotion and manipulation, with both state and pixel observations. We use 50 tasks from OGBench.
Additionally, we employ 6 antmaze and 12 adroit tasks from the D4RL benchmark.

Offline RL

Click to see the full table (73 tasks)

Offline RL performances aggregated across 50 OGBench tasks.

FQL achieves the best offline RL performance among ten methods on most of the 73 challenging tasks (averaged over 8 seeds), including whole-body humanoid control, object manipulation, and pixel-based control!
In particular, FQL achieves substantially better performance than Gaussian policy-based methods on manipulation tasks with highly multimodal action distributions. Also, FQL's one-step guidance significantly outperforms previous policy extraction techniques (e.g., weighted regression in FAWAC, recursive backpropagation in FBRAC, and rejection sampling in IFQL).
In the previous D4RL benchmark, FQL achieves the best performance of 84% on one of the hardest tasks, antmaze-large-play.

Offline-to-online RL

Click to see the full table (15 tasks)

Offline-to-online RL training curves averaged over 8 seeds. Online fine-tuning starts at 1M.

Another benefit of FQL is that it can be directly fine-tuned with online rollouts without any modifications.
FQL achieves the best performance among the six performant offline-to-online methods on most of the 15 tasks!

How fast is FQL?

Run time comparison.

FQL is one of the fastest flow-based offline RL methods.
Thanks to our one-step policy, FQL is only slightly slower than Gaussian policy-based methods, and faster than most flow-based baselines.

Do I need to tune flow-related hyperparameters?

Ablation study on the number of flow steps.

Ablation study on the flow time distribution.

No, in general!
You can simply use the default hyperparameters (10 flow steps and uniform sampling), and the performance is generally robust to these choices.
However, you do need to tune the BC coefficient $\alpha$, as is typical for most offline RL methods. See the paper for more ablation studies and discussions.

Citation

@inproceedings{fql_park2025,
  title={Flow Q-Learning},
  author={Seohong Park and Qiyang Li and Sergey Levine},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2025},
}

The website template was partly borrowed from Michaël Gharbi and Jon Barron.