Is Value Learning Really the Main Bottleneck in Offline RL?

Seohong Park
UC Berkeley 
Kevin Frans
UC Berkeley 
Sergey Levine
UC Berkeley 
Aviral Kumar
Google DeepMind
Motivation
 The main difference between offline reinforcement learning (RL) and imitation learning is the use of a value function, and most prior works in offline RL focus on learning better value functions with better techniques.
 So value function learning is the main bottleneck in offline RL... right?
 In this work, we show that this is often not the case in practice!
Analyzing the bottlenecks of offline RL
 Our main goal in this work is to understand how the performance of offline RL can be bottlenecked in practice.
 There are three potential bottlenecks in offline RL:
 (B1) Imperfect value estimation from data
 (B2) Imperfect policy extraction from the learned value function
 (B3) Imperfect policy generalization to testtime states that the policy visits during evaluation
 Note that the bottleneck of an offline RL algorithm can always be attributed to one or some of these factors, given that the agent will attain optimal performance if all of them are perfect.
 So all we need is to dissect these components individually!
Main hypothesis

Let's first state our main research hypothesis in this work:
"The main bottleneck in offline RL is policy learning, not value learning."
 In other words, (though of course value learning is important,) we argue that how the policy is extracted from the value function (B2) and how well the policy generalizes to states that it visits at evaluation time (B3) are often the main factors that significantly affect both performance and scalability in many problems.
 To verify this hypothesis, we conduct two main analyses in this work: in the first analysis, we compare value learning and policy extraction (B1 and B2), and in the second analysis, we analyze the effect of policy generalization (B3).
TL;DR

Here's a tl;dr of our analyses for those who want it 🙂:
(1) Policy extraction is often more important than value learning: Do not use weighted behavior cloning (AWR); always use behaviorconstrained policy gradient (DDPG+BC).
(2) Testtime policy generalization is one of the most significant bottlenecks in offline RL: Current offline RL is often already great at learning an effective policy on dataset states, and the performance is often simply determined by its performance on outofdistribution states.
Analysis 1: Is it the value or the policy? (B1 and B2)
 What's more important: value learning or policy extraction?
 To answer this question, we run different algorithms with different amounts of data for value function training and policy extraction, and draw datascaling matrices to visualize the results.
 This will tell us (1) whether the performance is bottlenecked by the value or the policy and (2) the performance difference between various value learning and policy extraction algorithms.
 To clearly dissect value learning and policy extraction, we focus on offline RL methods with decoupled value and policy learning phases in this work (e.g., IQL, onestep RL, CRL, etc.). This might sound a bit restrictive, but we will show that policy learning is often the main bottleneck even in these decoupled methods!
Analysis setup
 We consider the following algorithms and environments in this analysis.
 Three value learning algorithms: (1) Implicit Qlearning (IQL), (2) Onestep RL (SARSA), (3) Contrastive RL (CRL).

Three policy extraction algorithms:
 (1) Weighted behavioral cloning (e.g., AWR, RWR, AWAC, etc.): $$\begin{aligned} \max_\pi \ \mathcal{J}_\mathrm{AWR}(\pi) = \mathbb{E}_{s, a \sim \mathcal{D}}[e^{\alpha (Q(s, a)  V(s))} \log \pi(a \mid s)], \end{aligned}$$ $$ \newcommand{\rphantom}[1]{\mathrel{\phantom{#1}}} \begin{aligned} &\rphantom{=} \max_\pi \ \mathcal{J}_\mathrm{AWR}(\pi) \\ &= \mathbb{E}_{s, a \sim \mathcal{D}}[e^{\alpha (Q(s, a)  V(s))} \log \pi(a \mid s)], \end{aligned} $$ where \(\alpha\) is an (inverse) temperature parameter.
 (2) Behaviorconstrained policy gradient (e.g., DDPG+BC, TD3+BC, etc.): $$\begin{aligned} \max_\pi \ \mathcal{J}_\mathrm{DDPG+BC}(\pi) = \mathbb{E}_{s, a \sim \mathcal{D}}[Q(s, \mu^\pi(s)) + \alpha \log \pi(a \mid s)], \end{aligned}$$ $$\begin{aligned} &\rphantom{=} \max_\pi \ \mathcal{J}_\mathrm{DDPG+BC}(\pi) \\ &= \mathbb{E}_{s, a \sim \mathcal{D}}[Q(s, \mu^\pi(s)) + \alpha \log \pi(a \mid s)], \end{aligned}$$ where \(\mu^\pi(s)\) is the mean of the policy, and \(\alpha\) is a hyperparameter that controls the strength of the BC regularizer.
 (3) Samplingbased action selection (e.g., SfBC, BCQ, IDQL, etc.): $$ \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \begin{aligned} \pi(s) = \argmax_{a \in \{a_1, \ldots, a_N\}}[Q(s, a)], \end{aligned} $$ where \(a_1, \ldots, a_N\) are sampled from the learned BC policy \(\pi^\beta(\cdot \mid s)\).
 Eight tasks (including goalconditioned ("gc") and pixelbased ones!): (1) gcantmazelarge, (2) antmazelarge, (3) d4rlhopper, (4) d4rlwalker2d, (5) exorlwalker, (6) exorlcheetah, (7) kitchen, (8) (pixelbased) gcroboverse.
Results
 The plots above show the datascaling matrices. We highlight the results on exorlwalker and exorlcheetah. Click the above button to see the full datascaling matrices on eight tasks, aggregated from a total of 7744 runs. We individually tuned policy extraction hyperparameters for each matrix entry.
 By looking at color gradients, we can see how the performance of each algorithm scales with more data and/or how it is bottlenecked. Below, we highlight some key observations.
 First, we can see that policy extraction algorithms often affect the performance and datascaling trends much more than value learning objectives in general (perhaps except antmazelarge), even though they extract policies from the same value function!
 Among policy extraction algorithms, we find that DDPG+BC almost always achieves the best performance and scaling behaviors across the board, followed by SfBC, and the performance of AWR falls significantly behind the other two in many cases.
 We can also see that the datascaling matrices of AWR always have vertical or diagonal color gradients, which indicates that it does not fully utilize the value function (see below for clearer evidence).
 Read the paper for further discussions!
Deep dive: Why is DDPG+BC often much better than AWR?
 Although AWR and DDPG+BC are theoretically equivalent in some sense, we saw that DDPG+BC often leads to significantly better performances. Why is this the case?
 The main difference between AWR and DDPG+BC is in their objective: $$\begin{aligned} \mathcal{J}_\mathrm{AWR}(\pi) &= \mathbb{E}_{s, a \sim \mathcal{D}}[\underbrace{e^{\alpha (Q(s, a)  V(s))} \log \pi(a \mid s)}_{\text{Modecovering}}], \\ \mathcal{J}_\mathrm{DDPG+BC}(\pi) &= \mathbb{E}_{s, a \sim \mathcal{D}}[\underbrace{Q(s, \mu^\pi(s))}_{\text{Modeseeking}} + \underbrace{\alpha \log \pi(a \mid s)}_{\text{Modecovering}}]. \end{aligned}$$ $$\begin{aligned} &\rphantom{=} \mathcal{J}_\mathrm{AWR}(\pi) \\ &= \mathbb{E}_{s, a \sim \mathcal{D}}[\underbrace{e^{\alpha (Q(s, a)  V(s))} \log \pi(a \mid s)}_{\text{Modecovering}}], \\ &\rphantom{=} \mathcal{J}_\mathrm{DDPG+BC}(\pi) \\ &= \mathbb{E}_{s, a \sim \mathcal{D}}[\underbrace{Q(s, \mu^\pi(s))}_{\text{Modeseeking}} + \underbrace{\alpha \log \pi(a \mid s)}_{\text{Modecovering}}]. \end{aligned}$$
 The key difference is that, AWR only has a modecovering weighted behavioral cloning term, but DDPG+BC has both modeseeking firstorder value maximization and modecovering behavioral cloning terms. In other words, DDPG+BC has two modes, but AWR has only one!
 To illustrate this, we draw datascaling matrices of AWR and DDPG+BC with different temperature values (\(\alpha\)) on gcantmazelarge. Note that \(\alpha = 0\) corresponds to BC in AWR and \(\alpha = \infty\) corresponds to BC in DDPG+BC.
 The results clearly show a difference between AWR and DDPG+BC. That is, AWR is always policybounded regardless of the BC strength \(\alpha\) (i.e., color gradients are always vertical), whereas DDPG+BC has two modes: it is policybounded when \(\alpha\) is large, and valuebounded and when \(\alpha\) is small.
 Very interestingly, an inbetween value of \(\alpha = 1.0\) in DDPG+BC leads to the best of both worlds, significantly improving performance across the entire matrix (especially look at the bottom left corner with 0.1M datasets)!

In the paper, we further dig into the difference between AWR and DDPG+BC from several different perspectives.
Just to make you curious, we put two figures above without descriptions. 😄
Takeaway: Do not use weighted behavior cloning (AWR); always use behaviorconstrained policy gradient (DDPG+BC). This enables better scaling of performance with more data and better use of the value function.
Analysis 2: Policy generalization (B3)
 Now, we turn our focus to the third, unique bottleneck in offline RL: policy generalization.
 In offline RL, the agent encounters new, potentially outofdistribution states at test time, and how well it generalizes to such novel states can directly affect performance.
Analysis setup

To understand this generalization bottleneck, we first define three key metrics quantifying policy accuracy:
 Intuitively, these metrics measure how accurate the policy is on three different distributions.
 Perhaps you're already familiar with training MSE and validation MSE. But we have another metric: evaluation MSE, which might look similar to validation MSE but is actually very different.
 The key difference between validation MSE and evaluation MSE is that validation MSE measures indistribution policy accuracy, but evaluation MSE measures outofdistribution policy accuracy (see the figure above). Evaluation MSE precisely corresponds to the generalization bottleneck that we're going to measure.
 To see how these quantities correlate with performance, we observe how these metrics and performance evolve with additional online interaction data (commonly referred to as the offlinetoonline RL setting).
Results
 The plots above show how returns and MSE metrics improve with additional online interaction data. We denote online training steps in red.
 The results are quite surprising! We can see that (1) offlinetoonline RL mostly only improves evaluation MSEs, while validation MSEs and training MSEs often remain completely flat, and (2) the performance of offline RL is very strongly (inversely) correlated with the evaluation MSE metric.
 What does this mean? This means that current offline RL algorithms may already be sufficiently great at learning the best possible policy within the distribution of states covered by the offline dataset, and the performance is often simply determined by the policy accuracy on novel states that the agent encounters at test time!
 This provides a new perspective on generalization in offline RL, which is somewhat very different from the previous focus on pessimism and behavioral regularization.
 Then, how can we improve testtime policy generalization? Unfortunately, this is very difficult in principle, because it requires generalization to a potentially completely different distribution. Nonetheless, we can address this issue if we slightly relax assumptions, and we propose two such solutions in this work.
Solution 1: Improve offline data coverage
 Our first solution is to use a highcoverage dataset.
 The rationale is very simple: if testtime outofdistribution generalization is the bottleneck, we can simply make testtime states indistribution with more data (of course, when it's possible to collect more data 🙂)!
 In the plots above, we show that highcoverage datasets indeed improve performance, despite their increased suboptimality. Also, note that the use of a right policy extraction objective (DDPG+BC) is important in this case as well!
Solution 2: Testtime policy improvement
 If we don't have control over data, another way to improve testtime policy accuracy is to onthefly train or steer the policy on testtime states.
 To do this, we propose a really simple method in the paper, which we call onthefly policy extraction (OPEX). Our key idea is to simply adjust policy actions in the direction of the value gradient at evaluation time.
 Specifically, after sampling an action from the policy \(a \sim \pi(·  s)\) at test time, we further adjust the action based on the frozen learned Q function with the following formula. $$\begin{align} a \gets a + \beta \cdot \nabla_a Q(s, a), \end{align}$$ where \(\beta\) is a hyperparameter that corresponds to the testtime "learning rate".
 OPEX requires only a single line of additional code at evaluation and does not change the training procedure at all!
 In the paper, we propose another method named testtime training (TTT), which further updates policy parameters during testtime rollouts.

We show that these onthefly policy improvement techniques improve performance on diverse tasks, by mitigating the testtime policy generalization bottleneck.
Takeaway: Testtime policy generalization is one of the most significant bottlenecks in offline RL. Use highcoverage datasets. Improve policy accuracy on testtime states with onthefly policy improvement techniques.
So, what does our analysis tell us?
 We showed that, somewhat contrary to the prior belief that value learning is the main bottleneck of offline RL, current offline RL methods are often heavily bottlenecked by how faithfully the policy is extracted from the value function and how well this policy generalizes to testtime states.
 For practitioners, our analysis indicates a clear recipe for offline RL: train a value function on as diverse data as possible, and allow the policy to maximally utilize the value function, with the best policy extraction objective (e.g., DDPG+BC) and/or potential testtime policy improvement strategies.

For future algorithms research, we emphasize two important open questions in offline RL:
 (1) What is the best way to extract a policy from the learned value function? Is there a better way than DDPG+BC?
 (2) How can we train a policy in a way that it generalizes well on testtime states?
 The second question is particularly very interesting, because it suggests a diametrically opposed viewpoint to the prevailing theme of pessimism in offline RL, where only a few works have explicitly aimed to address this generalization aspect of offline RL!
 We hope that finding satisfying answers to these questions would lead to significant breakthroughs in offline RL, and encourage the community to incorporate a holistic picture of offline RL along the current mainstream research on value function learning.
The website template was borrowed from Michaël Gharbi and Jon Barron.