OGBench: Benchmarking Offline Goal-Conditioned RL
-
Seohong Park
UC Berkeley -
Kevin Frans
UC Berkeley -
Benjamin Eysenbach
Princeton University -
Sergey Levine
UC Berkeley
in offline goal-conditioned RL, offline unsupervised RL, and offline RL.
Overview
Why offline goal-conditioned RL?
Introducing OGBench
Features
- 8 types of cool, realistic, diverse environments.
- 85 datasets covering various challenges in offline goal-conditioned RL, such as long-horizon reasoning, stitching, and stochastic control.
- Support for both pixel-based and state-based observations.
- Clean, well-tuned reference implementations of 6 offline goal-conditioned RL algorithms (GCBC, GCIVL, GCIQL, QRL, CRL, and HIQL).
- Fully reproducible scripts for the entire benchmark table and datasets.
- pip-installable, easy-to-use APIs based on Gymnasium.
- No major dependencies other than MuJoCo.
Try OGBench!
# pip install ogbench
import ogbench
env, train_dataset, val_dataset = ogbench.make_env_and_datasets('humanoidmaze-large-navigate-v0')
Locomotion Tasks
PointMaze and AntMaze
PointMaze and AntMaze are standard maze navigation tasks. PointMaze involves controlling a 2-D point mass, and AntMaze involves controlling a quadrupedal Ant agent with 8 degrees of freedom (DoF). The aim of these tasks is to control the agent to reach a goal location in the given maze. The agent must learn both high-level navigation and low-level locomotion skills, purely from diverse offline trajectories.
OGBench substantially extends the original PointMaze and AntMaze tasks proposed by D4RL. Unlike the original D4RL tasks, which only support Point and Ant agents and do not challenge stitching, stochasticity, or pixel-based control, we support Humanoid control, pixel-based observations, and multi-goal evaluation, while providing more challenging and diverse types of mazes and datasets, as shown below.
HumanoidMaze
HumanoidMaze is a challenging maze navigation task that involves full-body control of a 21-DoF Humanoid agent. This requires substantially more complex and long-horizon reasoning than PointMaze and AntMaze. For example, the longest task requires up to 3,000 environment steps!
Maze types
We provide four types of mazes: medium, large, giant, and teleport, with different sizes and characteristics. The medium, large, and giant mazes provide increasing levels of difficulty. The teleport maze is a special maze designed to challenge the agent’s ability to handle environment stochasticity. This maze contains multiple stochastic teleporters, where a black hole immediately transports the agent to a randomly chosen white hole (scroll up a bit to see the video). However, since one of the three white holes is a dead-end, there is always a risk in taking a teleporter. Hence, the agent must learn to avoid the black holes, without being optimistically biased by "lucky" outcomes.
Dataset types
For maze navigation tasks, we provide three types of datasets: navigate, stitch, and explore, which pose different kinds of challenges. The navigate dataset is the standard dataset, collected by a noisy expert policy that randomly navigates the maze. The stitch dataset, which consists of short trajectory segments, is designed to challenge the agent’s stitching ability. The explore dataset, which consists of random exploratory trajectories, is designed to test whether the agent can learn navigation skills from extremely low-quality (yet high-coverage) data.
AntSoccer
AntSoccer is a different type of locomotion task that involves controlling an Ant agent to dribble a soccer ball. It is significantly harder than the AntMaze task because the agent must also carefully control the ball while navigating the environment. We provide two maze types, arena and medium, and two dataset types, navigate and stitch. The navigate dataset consists of trajectories where the agent repeatedly approaches the ball and dribbles it to random locations. The stitch datasets consist of two different types of trajectories, maze navigation without the ball and dribbling with the ball near the agent, so that stitching is required to complete the full task.
Visual maze navigation
AntMaze and HumanoidMaze support both state-based and pixel-based observations with \(64 \times 64 \times 3\) RGB camera images rendered from a third-person camera viewpoint. In pixel-based tasks, we color the floor to enable the agent to infer its location from the images. In these tasks, we do not provide any low-dimensional state information like joint angles; the agent must learn purely from image observations.
Manipulation Tasks
Cube
The Cube task is designed to evaluate the agent's basic object manipulation abilities. It involves pick-and-place manipulation of colored cube blocks. We provide four variants of the task: cube-single, cube-double, cube-triple, and cube-quadruple, with different numbers of cubes.
The goal of Cube is to control the robot arm to arrange cubes into designated configurations. Each environment provides five evaluation goals that require moving, stacking, swapping, or permuting cubes. The datasets are collected by a scripted policy that repeatedly picks a random block and places it in other random locations or on another block. The agent must learn pick-and-place skills from unstructured data (which doesn't necessarily contain the evaluation tasks) and combine them sequentially.
Scene
The Scene task is designed to challenge the sequential, long-horizon reasoning capabilities of the agent. It involves manipulating diverse types objects (cube, window, drawer, and button). If the agent presses a button, it toggles the lock status of the corresponding object (drawer or window). The dataset is collected by a scripted policy that randomly interacts with these objects. We provide five evaluation tasks that require the agent to arrange the objects into desired configurations. They require a significant degree of sequential reasoning: for example, the longest task involves 8 atomic behaviors (see the video on the right)! Hence, the agent must be able to plan and sequentially combine learned manipulation skills.
Puzzle
The Puzzle task is designed to test the combinatorial generalization abilities of the agent. It requires solving the "Lights Out" puzzle with a robot arm (try it out!). The rule is simple: when a button is pressed, it toggles the color of the button and its neighbors. We provide four levels of difficulty, puzzle-3x3, puzzle-4x4, puzzle-4x5, and puzzle-4x6, with different grid sizes.
The goal of Puzzle is to achieve a desired configuration of colors (e.g., turning all buttons blue). Each environment provides five evaluation tasks of varying difficulty. The datasets are collected by a scripted policy that randomly presses buttons in arbitrary sequences. Given the enormous state space (with up to \(2^{24} = 16,777,216\) distinct button states), the agent must achieve combinatorial generalization while mastering low-level continuous control.
Dataset types
For each manipulation task, we provide two types of datasets: play and noisy. The play datasets are the standard datasets, collected by non-Markovian expert policies with temporally correlated noise, which feature natural, realistic trajectories. To support more diverse types of research (e.g., dataset ablation studies), we additionally provide the noisy datasets collected by Markovian expert policies with larger, uncorrelated Gaussian noise, which feature high state coverage.
Visual manipulation
Every manipulation task supports both state-based and pixel-based observations with \(64 \times 64 \times 3\) RGB camera images. For pixel observations, we adjust colors and make the arm transparent to ensure full observability.
Drawing Tasks
Powderworld
Powderworld (try it out!) is a drawing task that presents unique challenges with high intrinsic dimensionality. The goal of Powderworld is to draw a target picture on a \(32 \times 32\) grid using different types of "powder" brushes, where each powder brush has a distinct physical property corresponding to a unique element. For example, the "sand" brush falls down and piles up, and the "fire" brush burns combustible elements like "plant". We provide three versions of tasks, powderworld-easy, powderworld-medium, and powderworld-hard, with different numbers of available elements. The datasets are collected by a random policy that keeps drawing arbitrary shapes with random brushes. To solve the tasks at evaluation time, the agent must achieve a high degree of generalization with a deep understanding of the physics, while handling high intrinsic dimensionality and stochasticity.
Results & Discussion
Benchmarking results
Which datasets should I use for my research?
Here is our guideline for choosing datasets. For general algorithms research in offline goal-conditioned RL, we recommend starting with more "regular" datasets, such as:
- antmaze-{large, giant}-navigate
- humanoidmaze-medium-navigate
- cube-{single, double}-play
- scene-play
- puzzle-3x3-play
From there, depending on the performance on these tasks, try harder versions of them or more challenging tasks, including:
- humanoidmaze-{large, giant}
- antsoccer
- puzzle-{4x4, 4x5, 4x6}
- powderworld
We also provide more specialized datasets that pose specific challenges in offline GCRL.
- For stitching, try the stitch datasets in the locomotion suite as well as complex manipulation tasks that require stitching (e.g., puzzle).
- For long-horizon reasoning, consider humanoidmaze-giant, which has the longest episode length, and puzzle-4x6, which has the most semantic steps.
- For stochastic control, try antmaze-teleport, which is specifically designed to challenge optimistically biased methods, and powderworld, which has unpredictable, stochastic dynamics.
- For learning from highly suboptimal data, consider antmaze-explore as well as the noisy datasets in the manipulation suite, which feature high suboptimality and high coverage.
Research opportunities
Here, we list potential research ideas and open questions in offline goal-conditioned RL that we believe are interesting and important.
Be the first to solve unsolved tasks! While all environments in OGBench have at least one variant that current methods can solve to some degree, there are still a number of challenging tasks on which no existing method achieves non-trivial performance, such as humanoidmaze-giant, cube-triple, puzzle-4x5, powderworld-hard, and more. We ensure that sufficient data is available for those tasks (which is estimated from the amount needed to solve their easier versions). We invite researchers to take on these challenges and push the limits of offline goal-conditioned RL with better algorithms!
How can we develop a policy that generalizes well at test time? In our experiments, we found hierarchical RL methods (e.g., HIQL) to work especially well in several tasks. We hypothesize that this is mainly because hierarchical RL reduces learning complexity by having two policies specialized in different things, which makes both policies generalize better at evaluation time. After all, test-time generalization is known to be one of the major bottlenecks in offline RL. But, are hierarchies really necessary to achieve good test-time generalization? Can we develop a non-hierarchical method that enjoys the same benefit by exploiting the subgoal structure of offline goal-conditioned RL? This would be especially beneficial, not just because it is simpler, but also because it can potentially yield better, unified representations that can potentially serve as a "foundation model" for fine-tuning.
Can we develop a method that works well across all categories? Our results show that no method consistently performs best across the board. HIQL tends to achieve strong performance but struggles in pixel-based locomotion and state-based manipulation. GCIQL shows strong performance in state-based manipulation, but struggles in locomotion. CRL exhibits the opposite trend: it excels in locomotion but underperforms in manipulation. Is there a way to combine only the strengths of these methods to develop a single approach that achieves the best performance across all types of tasks?
More concrete research questions
Why is PointMaze so hard? The benchmark table shows that PointMaze is surprisingly hard, sometimes even harder than AntMaze(!) for some methods. Why is this the case? Moreover, only in PointMaze does QRL significantly outperform the other methods. What causes this difference, and are there any insights we can take from these results?
How should we train subgoal representations? Somewhat weirdly, HIQL struggles much more with state-based observations than pixel-based observations on manipulation tasks. We suspect this is related to subgoal representations, given that HIQL uses an additional learning signal from the policy loss to further train subgoal representations only in pixel-based environments (which we found does not help in state-based environments). HIQL uses a value function-based subgoal representation, but is there a better, more stable way to learn subgoal representations for hierarchical RL and planning?
Do we really need the full power of RL? While learning the optimal \(Q^*\) function is in principle better than learning the behavioral \(Q^\beta\) function, CRL (which fits \(Q^\beta\)) significantly outperforms GCIQL (which fits \(Q^*\)) on locomotion tasks. Why is this the case? Is it a problem with expectile regression in GCIQL or with temporal difference learning itself? In contrast, in manipulation environments, the result suggests the opposite: GCIQL is much better than CRL. Does this mean we do need \(Q^*\) in these tasks? Or can it be solvable even with \(Q^\beta\) if we use a better behavioral value learning technique than binary NCE objective in CRL?
Why can't we use random goals when training policies? When training goal-conditioned policies, we found that it is usually better to sample (policy) goals only from the future state in the current trajectory (except on stitch or explore datasets). The fact that this works better even in Scene and Puzzle (which require goal stitching) is a bit surprising, because it means that the policy can still perform goal stitching to some degree even without being explicitly trained on the test-time state-goal pairs. At the same time, it is rather unsatisfying because this ability to stitch goals entirely depends on the seemingly "magical"🧙 generalization capabilities of neural networks. Is there a way to train a goal-conditioned policy with random goals while maintaining performance, so that it can perform goal stitching in a principled manner?
How can we combine expressive policies with goal-conditioned RL? If we compare the results on cube-single-play and cube-single-noisy, we can see that current offline goal-conditioned RL methods often struggle with datasets collected by non-Markovian policies in manipulation environments. Handling non-Markovian trajectory data is indeed one of the major challenges in behavioral cloning, for which many recent behavioral cloning-based methods have been proposed (e.g., ACT and Diffusion Policy). Can we incorporate these recent advancements in behavioral cloning into offline goal-conditioned RL?
We believe OGBench provides a foundation for answering these research questions, hoping that it leads to performant, scalable offline GCRL algorithms that enable building RL foundation models!
Citation
@article{ogbench_park2024,
title={OGBench: Benchmarking Offline Goal-Conditioned RL},
author={Seohong Park and Kevin Frans and Benjamin Eysenbach and Sergey Levine},
journal={ArXiv},
year={2024}
}
The website template was partly borrowed from Michaël Gharbi and Jon Barron.