Flow Q-Learning

Overview

  • Flow Q-learning (FQL) is a simple and effective method that trains an expressive flow-matching policy for data-driven reinforcement learning (RL), including offline RL and offline-to-online RL.
  • FQL is simple! Thanks to the simplicity of flow matching, FQL can be implemented in a few lines on top of an existing actor-critic framework, without requiring tuning noise or sampling strategies.
  • FQL is scalable and performant. FQL's expressive flow policy can handle data with arbitrarily complex, multimodal action distributions, achieving strong performance on a variety of challenging robotic locomotion and manipulation tasks.
\( \definecolor{plgray}{RGB}{153, 153, 153} \newcommand{\pl}[1]{{\color{plgray} #1}} \)

Challenge

  • Training a flow policy with behavioral cloning (BC) is straightforward. We can simply train a velocity field \(v_\theta(\pl{t}, \pl{s}, \pl{x}): [0, 1] \times \mathcal{S} \times \mathbb{R}^d \to \mathbb{R}^d\) with the standard flow-matching loss (see this excellent tutorial if you're not familiar with flow matching!): $$\begin{aligned} \mathbb{E}_{\substack{s, a=x^1 \sim \mathcal{D}, \\ x^0 \sim \mathcal{N}(0, I_d), \\ t \sim \mathrm{Unif}([0, 1])}} \left[\|v_\theta(t, s, x^t) - (x^1 - x^0)\|_2^2\right], \end{aligned}$$ where \(x^t = (1 - t)x^0 + tx^1\) is a linear interpolation between two samples. Note that flow matching happens in the action space \(\mathcal{A} = \mathbb{R}^d\).
  • The resulting velocity field \(v_\theta\) will generate a flow that transforms the unit Gaussian into the behavioral action distribution, \(\pi^\beta(\pl{a} \mid \pl{s})\).
  • But, how can we train a flow policy with reinforcement learning to maximize rewards? Our goal is, given a dataset \(\mathcal{D}\) of transitions \((s, a, r, s')\), to train a flow policy that maximizes the expected return while not deviating too much from the dataset.
  • This is a highly non-trivial problem because flow policies are iterative. Naïvely maximizing values with a flow policy will require recursive backpropagation through the flow, which often leads to unstable training and suboptimal performance.

Idea

  • We propose a new, simple, and effective mechanism to train a flow policy with RL, which we call flow Q-learning (FQL).
  • Our main idea is to train a separate, expressive one-step policy with RL, while training the flow policy only with behavioral cloning. We train the one-step policy to maximize Q-values while regularizing it with distillation from the BC flow policy. We call this technique one-step guidance.
  • This way, we can completely avoid tricky problems with guiding iterative generative models (e.g., recursive backpropagation), because we're only doing RL with the one-step policy!
  • Moreover, the output of this procedure is the efficient one-step policy, which doesn't involve any iterative computation at test time.
  • But, is the one-step policy expressive enough to capture complex action distributions? The answer is yes! Even for image generation, many previous works have shown that distilled one-step models are still capable of generating high-quality samples (e.g., rectified flow, shortcut models).

Algorithm

  • We now fully describe FQL's objectives. It has three components: critic \(Q_\phi(\pl{s}, \pl{a})\), BC flow policy \(\mu_\theta(\pl{s}, \pl{z})\), and one-step policy \(\mu_\omega(\pl{s}, \pl{z})\).
  • The critic \(Q_\phi(\pl{s}, \pl{a})\) is trained with the standard Bellman loss, as usual.
  • The BC flow policy \(\mu_\theta(\pl{s}, \pl{z})\) is trained only with the BC flow-matching loss.
  • The one-step policy \(\mu_\omega(\pl{s}, \pl{z})\) is trained to maximize values while distilling the BC flow policy with the following loss: $$\begin{aligned} \underbrace{\mathbb{E}_{s \sim \mathcal{D}, a^\pi \sim \pi_\omega}[-Q_\phi(s, a^\pi)]}_{\texttt{Q loss}} + \alpha \cdot \underbrace{\mathbb{E}_{s \sim \mathcal{D}, z \sim \mathcal{N}(0, I_d)} \left[\|\mu_\omega(s, z) - \mu_\theta(s, z)\|_2^2\right]}_{\texttt{Distillation loss}}, \end{aligned}$$ $$\begin{aligned} &\underbrace{\mathbb{E}_{s \sim \mathcal{D}, a^\pi \sim \pi_\omega}[-Q_\phi(s, a^\pi)]}_{\texttt{Q loss}} \\ + &\alpha \cdot \underbrace{\mathbb{E}_{s \sim \mathcal{D}, z \sim \mathcal{N}(0, I_d)} \left[\|\mu_\omega(s, z) - \mu_\theta(s, z)\|_2^2\right]}_{\texttt{Distillation loss}}, \end{aligned}$$ where \(\alpha\) is a hyperparameter that balances the two losses, and \(\pi_\omega\) denotes the stochastic policy induced by the deterministic map \(\mu_\omega(\pl{s}, \pl{z})\).
  • The output of FQL is the one-step policy \(\mu_\omega(\pl{s}, \pl{z})\).
  • Note that the Q loss doesn't involve recursive backpropagation, thanks to the one-step policy!

Experiments

Tasks

antmaze

humanoidmaze

antsoccer

cube

scene

puzzle
  • We use the recently proposed OGBench task suite as the main benchmark. OGBench provides a number of diverse, challenging tasks across robotic locomotion and manipulation, with both state and pixel observations. We use 50 tasks from OGBench.
  • Additionally, we employ 6 antmaze and 12 adroit tasks from the D4RL benchmark.

Offline RL

Offline RL performances aggregated across 50 OGBench tasks.
  • FQL achieves the best offline RL performance among ten methods on most of the 73 challenging tasks (averaged over 8 seeds), including whole-body humanoid control, object manipulation, and pixel-based control!
  • In particular, FQL achieves substantially better performance than Gaussian policy-based methods on manipulation tasks with highly multimodal action distributions. Also, FQL's one-step guidance significantly outperforms previous policy extraction techniques (e.g., weighted regression in FAWAC, recursive backpropagation in FBRAC, and rejection sampling in IFQL).
  • In the previous D4RL benchmark, FQL achieves the best performance of 84% on one of the hardest tasks, antmaze-large-play.

Offline-to-online RL

Offline-to-online RL training curves averaged over 8 seeds. Online fine-tuning starts at 1M.
  • Another benefit of FQL is that it can be directly fine-tuned with online rollouts without any modifications.
  • FQL achieves the best performance among the six performant offline-to-online methods on most of the 15 tasks!

How fast is FQL?

Run time comparison.
  • FQL is one of the fastest flow-based offline RL methods.
  • Thanks to our one-step policy, FQL is only slightly slower than Gaussian policy-based methods, and faster than most flow-based baselines.

Do I need to tune flow-related hyperparameters?

Ablation study on the number of flow steps.

Ablation study on the flow time distribution.
  • No, in general!
  • You can simply use the default hyperparameters (10 flow steps and uniform sampling), and the performance is generally robust to these choices.
  • However, you do need to tune the BC coefficient \(\alpha\), as is typical for most offline RL methods. See the paper for more ablation studies and discussions.

Citation

@article{fql_park2025,
  title={Flow Q-Learning},
  author={Seohong Park and Qiyang Li and Sergey Levine},
  journal={ArXiv},
  year={2025}
}

The website template was partly borrowed from Michaël Gharbi and Jon Barron.