Foundation Policies with Hilbert Representations

Seohong Park
UC Berkeley 
Tobias Kreiman
UC Berkeley 
Sergey Levine
UC Berkeley
Question
 Imagine we have a bunch of unlabeled trajectories, such as human demonstrations, taskagnostic robotic datasets, or records from previously deployed agents.
 How should we pretrain a generalist policy ("foundation policy") from such an unlabeled dataset? Which objective should we use?
 Perhaps we can use behavioral cloning (BC) or offline goalconditioned reinforcement learning (GCRL) to pretrain policies. However, BC requires expert demonstrations and GCRL can only learn goalconditioned policies. Is there a better way than these two?
The HILP Framework
 In this work, we propose a novel unsupervised policy pretraining scheme that captures diverse, optimal, longhorizon behaviors from unlabeled data. These behaviors are learned in a way that they can be quickly adapted to various downstream tasks in a zeroshot manner.
 Our method consists of two components: Hilbert representations and Hilbert foundation policies (HILPs).
Hilbert Representations
 Our idea starts from learning a distancepreserving representation, which we call a Hilbert representation, from offline data.
 Specifically, we train a representation \(\phi : \mathcal{S} \to \mathcal{Z}\) that maps the state space \(\mathcal{S}\) into a Hilbert space \(\mathcal{Z}\) (i.e., a metric space with a welldefined inner product) such that $$\begin{aligned} d^*(s, g) = \\phi(s)  \phi(g)\ \end{aligned}$$ holds for every \(s, g \in \mathcal{S}\), where \(d^*\) denotes the temporal distance (i.e., the minimum number of steps).
 This can be viewed as a temporal distancebased abstraction of the state space, where temporally similar states are mapped to nearby points in the latent space. This way, we can abstract the dataset states while preserving their longhorizon global relationships.
Hilbert Foundation Policies (HILPs)
 We then train a latentconditioned policy \(\pi(a \mid s, z)\) that spans the learned latent space with directional movements. We use the following intrinsic reward based on the inner product to train the policy: $$\begin{aligned} r(s, z, s') = \langle \phi(s')  \phi(s), z \rangle. \end{aligned}$$
 Intuitively, by learning to move in every possible direction specified by a unit vector \(z \in \mathcal{Z}\), the policy learns diverse longhorizon behaviors that optimally span the latent space as well as the state space.
 We call the resulting policy \(\pi(a \mid s, z)\) a Hilbert foundation policy (HILP) for the versatility we describe below.
Why are HILPs useful?
 HILPs have a number of appealing properties.
 First, HILPs capture a variety of diverse behaviors, or skills, from offline data. These behaviors can be hierarchically combined or finetuned to solve downstream tasks efficiently.
 Second, behaviors captured by HILPs are provably optimal for solving goalreaching tasks, which makes our method subsume goalconditioned RL as a special case, while providing for much more diverse behaviors.
 Third, the linear structure of the HILP reward enables zeroshot RL: at test time, we can immediately find the best latent vector \(z\) that solves a given task simply by linear regression.
 Fourth, the HILP framework yields a highly structured Hilbert representation \(\phi\), which enables efficient testtime planning without any additional training.
Experiments (ZeroShot RL)
Environments
 For zeroshot RL experiments, we use four DM Control environments and ExORL datasets (Yarats et al., 2022) collected by unsupervised RL agents. Each environment provides four test tasks (e.g., Flip, Run, Stand, and Walk for Walker), where the agent must solve them at test time without any additional training.
Results
 HILPs achieve the best overall zeroshot RL performance, outperforming previous successor feature and goalconditioned RL methods.
 Even in pixelbased ExORL benchmarks, HILPs achieve the best performance, outperforming the two strongest prior approaches.
Experiments (ZeroShot GoalConditioned RL)
Environments
 For zeroshot goalconditioned RL, we use three challenging longhorizon environments from the D4RL benchmark suite (Fu et al., 2020). We also employ a pixelbased version of the Kitchen environment. The agent must reach a goal state from a given start state, where the goal is specified at test time.
Results
 HILPs significantly outperform previous general unsupervised policy pretraining methods, and achieves comparable performance to methods that are specifically designed for goalconditioned RL.
 Moreover, with our efficient testtime planning procedure based on Hilbert representations, HILPs often even outperform offline goalconditioned RL methods.
Visualization of Hilbert Representations
 We visualize Hilbert representations learned on the antmazelargediverse dataset.
 Since Hilbert representations are learned to capture the temporal structure of the MDP, they focus on the global layout of the maze even when we use a twodimensional latent space, and accurately capture the maze layout with a 32dimensional latent space.
 Thanks to the Hilbert structure, we can reach the goal state by simply moving in the goal direction in the latent space.
The website template was borrowed from MichaĆ«l Gharbi and Jon Barron.