Flow Q-Learning
-
Seohong Park
UC Berkeley -
Qiyang Li
UC Berkeley -
Sergey Levine
UC Berkeley
Overview
- Flow Q-learning (FQL) is a simple and effective method that trains an expressive flow-matching policy for data-driven reinforcement learning (RL), including offline RL and offline-to-online RL.
- FQL is simple! Thanks to the simplicity of flow matching, FQL can be implemented in a few lines on top of an existing actor-critic framework, without requiring tuning noise or sampling strategies.
- FQL is scalable and performant. FQL's expressive flow policy can handle data with arbitrarily complex, multimodal action distributions, achieving strong performance on a variety of challenging robotic locomotion and manipulation tasks.
\(
\definecolor{plgray}{RGB}{153, 153, 153}
\newcommand{\pl}[1]{{\color{plgray} #1}}
\)
Challenge
- Training a flow policy with behavioral cloning (BC) is straightforward. We can simply train a velocity field \(v_\theta(\pl{t}, \pl{s}, \pl{x}): [0, 1] \times \mathcal{S} \times \mathbb{R}^d \to \mathbb{R}^d\) with the standard flow-matching loss (see this excellent tutorial if you're not familiar with flow matching!): $$\begin{aligned} \mathbb{E}_{\substack{s, a=x^1 \sim \mathcal{D}, \\ x^0 \sim \mathcal{N}(0, I_d), \\ t \sim \mathrm{Unif}([0, 1])}} \left[\|v_\theta(t, s, x^t) - (x^1 - x^0)\|_2^2\right], \end{aligned}$$ where \(x^t = (1 - t)x^0 + tx^1\) is a linear interpolation between two samples. Note that flow matching happens in the action space \(\mathcal{A} = \mathbb{R}^d\).
- The resulting velocity field \(v_\theta\) will generate a flow that transforms the unit Gaussian into the behavioral action distribution, \(\pi^\beta(\pl{a} \mid \pl{s})\).
- But, how can we train a flow policy with reinforcement learning to maximize rewards? Our goal is, given a dataset \(\mathcal{D}\) of transitions \((s, a, r, s')\), to train a flow policy that maximizes the expected return while not deviating too much from the dataset.
- This is a highly non-trivial problem because flow policies are iterative. Naïvely maximizing values with a flow policy will require recursive backpropagation through the flow, which often leads to unstable training and suboptimal performance.
Idea
- We propose a new, simple, and effective mechanism to train a flow policy with RL, which we call flow Q-learning (FQL).
- Our main idea is to train a separate, expressive one-step policy with RL, while training the flow policy only with behavioral cloning. We train the one-step policy to maximize Q-values while regularizing it with distillation from the BC flow policy. We call this technique one-step guidance.
- This way, we can completely avoid tricky problems with guiding iterative generative models (e.g., recursive backpropagation), because we're only doing RL with the one-step policy!
- Moreover, the output of this procedure is the efficient one-step policy, which doesn't involve any iterative computation at test time.
- But, is the one-step policy expressive enough to capture complex action distributions? The answer is yes! Even for image generation, many previous works have shown that distilled one-step models are still capable of generating high-quality samples (e.g., rectified flow, shortcut models).
Algorithm
- We now fully describe FQL's objectives. It has three components: critic \(Q_\phi(\pl{s}, \pl{a})\), BC flow policy \(\mu_\theta(\pl{s}, \pl{z})\), and one-step policy \(\mu_\omega(\pl{s}, \pl{z})\).
- The critic \(Q_\phi(\pl{s}, \pl{a})\) is trained with the standard Bellman loss, as usual.
- The BC flow policy \(\mu_\theta(\pl{s}, \pl{z})\) is trained only with the BC flow-matching loss.
- The one-step policy \(\mu_\omega(\pl{s}, \pl{z})\) is trained to maximize values while distilling the BC flow policy with the following loss: $$\begin{aligned} \underbrace{\mathbb{E}_{s \sim \mathcal{D}, a^\pi \sim \pi_\omega}[-Q_\phi(s, a^\pi)]}_{\texttt{Q loss}} + \alpha \cdot \underbrace{\mathbb{E}_{s \sim \mathcal{D}, z \sim \mathcal{N}(0, I_d)} \left[\|\mu_\omega(s, z) - \mu_\theta(s, z)\|_2^2\right]}_{\texttt{Distillation loss}}, \end{aligned}$$ $$\begin{aligned} &\underbrace{\mathbb{E}_{s \sim \mathcal{D}, a^\pi \sim \pi_\omega}[-Q_\phi(s, a^\pi)]}_{\texttt{Q loss}} \\ + &\alpha \cdot \underbrace{\mathbb{E}_{s \sim \mathcal{D}, z \sim \mathcal{N}(0, I_d)} \left[\|\mu_\omega(s, z) - \mu_\theta(s, z)\|_2^2\right]}_{\texttt{Distillation loss}}, \end{aligned}$$ where \(\alpha\) is a hyperparameter that balances the two losses, and \(\pi_\omega\) denotes the stochastic policy induced by the deterministic map \(\mu_\omega(\pl{s}, \pl{z})\).
- The output of FQL is the one-step policy \(\mu_\omega(\pl{s}, \pl{z})\).
- Note that the Q loss doesn't involve recursive backpropagation, thanks to the one-step policy!
Experiments
Tasks
antmaze
humanoidmaze
antsoccer
cube
scene
puzzle
- We use the recently proposed OGBench task suite as the main benchmark. OGBench provides a number of diverse, challenging tasks across robotic locomotion and manipulation, with both state and pixel observations. We use 50 tasks from OGBench.
- Additionally, we employ 6 antmaze and 12 adroit tasks from the D4RL benchmark.
Offline RL
- FQL achieves the best offline RL performance among ten methods on most of the 73 challenging tasks (averaged over 8 seeds), including whole-body humanoid control, object manipulation, and pixel-based control!
- In particular, FQL achieves substantially better performance than Gaussian policy-based methods on manipulation tasks with highly multimodal action distributions. Also, FQL's one-step guidance significantly outperforms previous policy extraction techniques (e.g., weighted regression in FAWAC, recursive backpropagation in FBRAC, and rejection sampling in IFQL).
- In the previous D4RL benchmark, FQL achieves the best performance of 84% on one of the hardest tasks, antmaze-large-play.
Offline-to-online RL
- Another benefit of FQL is that it can be directly fine-tuned with online rollouts without any modifications.
- FQL achieves the best performance among the six performant offline-to-online methods on most of the 15 tasks!
How fast is FQL?
- FQL is one of the fastest flow-based offline RL methods.
- Thanks to our one-step policy, FQL is only slightly slower than Gaussian policy-based methods, and faster than most flow-based baselines.
Do I need to tune flow-related hyperparameters?
- No, in general!
- You can simply use the default hyperparameters (10 flow steps and uniform sampling), and the performance is generally robust to these choices.
- However, you do need to tune the BC coefficient \(\alpha\), as is typical for most offline RL methods. See the paper for more ablation studies and discussions.
Citation
@article{fql_park2025,
title={Flow Q-Learning},
author={Seohong Park and Qiyang Li and Sergey Levine},
journal={ArXiv},
year={2025}
}
The website template was partly borrowed from Michaël Gharbi and Jon Barron.