HIQL: Offline GoalConditioned RL with Latent States as Actions
NeurIPS 2023

Seohong Park
UC Berkeley 
Dibya Ghosh
UC Berkeley 
Benjamin Eysenbach
Carnegie Mellon University 
Sergey Levine
UC Berkeley
Abstract
Unsupervised pretraining has recently become the bedrock for computer vision and natural language processing. In reinforcement learning (RL), goalconditioned RL can potentially provide an analogous selfsupervised approach for making use of large quantities of unlabeled (rewardfree) data. However, building effective algorithms for goalconditioned RL that can learn directly from diverse offline data is challenging, because it is hard to accurately estimate the exact value function for faraway goals. Nonetheless, goalreaching problems exhibit structure, such that reaching distant goals entails first passing through closer subgoals. This structure can be very useful, as assessing the quality of actions for nearby goals is typically easier than for more distant goals. Based on this idea, we propose a hierarchical algorithm for goalconditioned RL from offline data. Using one actionfree value function, we learn two policies that allow us to exploit this structure: a highlevel policy that treats states as actions and predicts (a latent representation of) a subgoal and a lowlevel policy that predicts the action for reaching this subgoal. Through analysis and didactic examples, we show how this hierarchical decomposition makes our method robust to noise in the estimated value function. We then apply our method to offline goalreaching benchmarks, showing that our method can solve longhorizon tasks that stymie prior methods, can scale to highdimensional image observations, and can readily make use of actionfree data.
Why is Offline GoalConditioned RL Hard?
 Offline goalconditioned RL is challenging because the "signaltonoise" ratio of the learned value function can be very low for faraway goals.
 As illustrated in the figure above, as a state gets farther from the goal, the policy becomes more erroneous due to the noise in the learned value function.
Solution: Hierarchical Policy Extraction
 To address this issue, we extract hierarchical policies from the same value function.
 The highlevel policy \(\pi^h(s_{t+k}s_t, g)\), which treats states as actions, produces an intermediate subgoal \(s_{t+k}\) to reach the goal \(g\).
 The lowlevel policy \(\pi^\ell(a_ts_t, s_{t+k})\) produces actions to reach the subgoal.
 This hierarchical structure improves the "signaltonoise" ratio due to the improved relative differences between values, even though both policies are extracted from the same value function.
Hierarchical Implicit QLearning (HIQL)
HIQL Objective
 Based on the idea above, our method, HIQL, trains a single goalconditioned value function with IQL, from which we extract two policies with AWR.
 The value function minimizes the IQL loss: \(\mathbb{E}[L_2^\tau (r(s, g) + \gamma \bar{V}(s', g)  V(s, g))]\), where \(L_2^\tau\) is the expectile loss.
 The highlevel policy maximizes the highlevel AWR objective: \(\mathbb{E}[\log \pi^h(s_{t+k} \mid s_t, g) e^{\beta (V(s_{t+k}, g)  V(s_t, g))}]\).
 The lowlevel policy maximizes the lowlevel AWR objective: \(\mathbb{E}[\log \pi^\ell(a_t \mid s_t, s_{t+k}) e^{\beta (V(s_{t+1}, s_{t+k})  V(s_t, s_{t+k}))}]\).
Representations for Subgoals
 Since directly predicting highdimensional subgoals is challenging, our highlevel policy produces lowdimensional representations of subgoals.
 To do this, HIQL parameterizes the value function as \(V(s, \phi(g))\) and simply uses \(\phi(g)\) as the goal representation. We prove that this simple valuebased representation does not lose any information about the goal for the policy.
 Our highlevel policy hence produces \(z_{t+k} = \phi(s_{t+k})\) instead of \(s_{t+k}\).
Leveraging ActionFree Data
 One important property of HIQL is that only the lowlevel policy requires action labels.
 As such, we can leverage a potentially large amount of video data, when (pre)training the value function and highlevel policy.
Environments
Results
 HIQL mostly achieves the best performance in both statebased and pixelbased benchmarks.
AntMaze Videos
Subgoal visualization
Visual AntMaze
Procgen Maze Videos
 To visualize subgoals in the latent representation space, we find the maze positions that have the closest representations to the outputs of the highlevel policy.
The website template was borrowed from MichaĆ«l Gharbi and Jon Barron.