OGBench: Benchmarking Offline Goal-Conditioned RL

Seohong Park
UC Berkeley
Kevin Frans
UC Berkeley
Benjamin Eysenbach
Princeton University
Sergey Levine
UC Berkeley

Paper
Code

OGBench is a benchmark designed to facilitate algorithms research
in offline goal-conditioned RL, offline unsupervised RL, and offline RL.

Overview

Why offline goal-conditioned RL?

Offline goal-conditioned reinforcement learning (RL) is a problem with an extremely simple formulation: learn to reach any state from any other state in the dataset in the fewest steps possible. This is, however, one of the most important problems in RL, as it provides a natural way to train an "RL foundation model" from data. The objective of offline goal-conditioned RL is simple, unsupervised and domain-agnostic, which enables training a versatile, general-purpose policy as well as rich representations from unlabeled data, just like how the simple next-token prediction objective leads to general-purpose language models.

Introducing OGBench

Despite the importance of and increasing interest in offline goal-conditioned RL, we currently lack a standard benchmark that can systematically evaluate the capabilities of offline goal-conditioned RL algorithms. To address this gap, we introduce Offline Goal-Conditioned RL Benchmark (OGBench), a new benchmark designed to facilitate algorithms research in offline goal-conditioned RL. We have designed many new tasks and datasets that pose diverse algorithmic challenges, such as goal stitching, stochastic control, long-horizon reasoning, and more. We also release clean, well-tuned reference implementations of representative offline goal-conditioned RL algorithms, which can serve as a scaffold for developing new algorithms.

Features

8 types of realistic and diverse environments.
85 datasets covering various challenges in offline goal-conditioned RL, such as long-horizon reasoning, stitching, and stochastic control.
410 tasks for standard (i.e., non-goal-conditioned) offline RL.
Support for both pixel-based and state-based observations.
Clean, well-tuned reference implementations of 6 offline goal-conditioned RL algorithms (GCBC, GCIVL, GCIQL, QRL, CRL, and HIQL).
Fully reproducible scripts for the entire benchmark table and datasets.
pip-installable, easy-to-use APIs based on Gymnasium.
No major dependencies other than MuJoCo.

Try OGBench!

# pip install ogbench
import ogbench
env, train_dataset, val_dataset = ogbench.make_env_and_datasets('humanoidmaze-large-navigate-v0')

See our GitHub for more details and examples.

Locomotion Tasks

OGBench provides four types of locomotion environments, PointMaze, AntMaze, HumanoidMaze, and AntSoccer, with diverse variants. They are designed to test the agent's long-horizon and hierarchical reasoning abilities.

PointMaze and AntMaze

pointmaze-large-navigate
Example dataset trajectory

antmaze-teleport-navigate
Example dataset trajectory

antmaze-giant-navigate
Example dataset trajectory

PointMaze and AntMaze are standard maze navigation tasks. PointMaze involves controlling a 2-D point mass, and AntMaze involves controlling a quadrupedal Ant agent with 8 degrees of freedom (DoF). The aim of these tasks is to control the agent to reach a goal location in the given maze. The agent must learn both high-level navigation and low-level locomotion skills, purely from diverse offline trajectories.

OGBench substantially extends the original PointMaze and AntMaze tasks proposed by D4RL. Unlike the original D4RL tasks, which only support Point and Ant agents and do not challenge stitching, stochasticity, or pixel-based control, we support Humanoid control, pixel-based observations, and multi-goal evaluation, while providing more challenging and diverse types of mazes and datasets, as shown below.

HumanoidMaze

humanoidmaze-medium-navigate
Example dataset trajectory

humanoidmaze-giant-navigate
Example dataset trajectory

HumanoidMaze is a challenging maze navigation task that involves full-body control of a 21-DoF Humanoid agent. This requires substantially more complex and long-horizon reasoning than PointMaze and AntMaze. For example, the longest task requires up to 3,000 environment steps!

Maze types

medium

large

giant

teleport

We provide four types of mazes: medium, large, giant, and teleport, with different sizes and characteristics. The medium, large, and giant mazes provide increasing levels of difficulty. The teleport maze is a special maze designed to challenge the agent’s ability to handle environment stochasticity. This maze contains multiple stochastic teleporters, where a black hole immediately transports the agent to a randomly chosen white hole (scroll up a bit to see the video). However, since one of the three white holes is a dead-end, there is always a risk in taking a teleporter. Hence, the agent must learn to avoid the black holes, without being optimistically biased by "lucky" outcomes.

Dataset types

antmaze-large-navigate
Example dataset trajectory

antmaze-large-stitch
Example dataset trajectories

antmaze-large-explore
Example dataset trajectories

For maze navigation tasks, we provide three types of datasets: navigate, stitch, and explore, which pose different kinds of challenges. The navigate dataset is the standard dataset, collected by a noisy expert policy that randomly navigates the maze. The stitch dataset, which consists of short trajectory segments, is designed to challenge the agent’s stitching ability. The explore dataset, which consists of random exploratory trajectories, is designed to test whether the agent can learn navigation skills from extremely low-quality (yet high-coverage) data.

AntSoccer

antsoccer-arena-stitch
Example dataset trajectory

antsoccer-medium-navigate
Example dataset trajectory

AntSoccer is a different type of locomotion task that involves controlling an Ant agent to dribble a soccer ball. It is significantly harder than the AntMaze task because the agent must also carefully control the ball while navigating the environment. We provide two maze types, arena and medium, and two dataset types, navigate and stitch. The navigate dataset consists of trajectories where the agent repeatedly approaches the ball and dribbles it to random locations. The stitch datasets consist of two different types of trajectories, maze navigation without the ball and dribbling with the ball near the agent, so that stitching is required to complete the full task.

Visual maze navigation

visual-antmaze-teleport-navigate
Example dataset trajectory

visual-humanoidmaze-medium-navigate
Example dataset trajectory

AntMaze and HumanoidMaze support both state-based and pixel-based observations with \(64 \times 64 \times 3\) RGB camera images rendered from a third-person camera viewpoint. In pixel-based tasks, we color the floor to enable the agent to infer its location from the images. In these tasks, we do not provide any low-dimensional state information like joint angles; the agent must learn purely from image observations.

Manipulation Tasks

OGBench provides three types of manipulation tasks: Cube, Scene, and Puzzle. They are designed to test the agent’s object manipulation, sequential generalization, and combinatorial generalization abilities.

Cube

cube-single

cube-double

cube-triple

cube-quadruple

The Cube task is designed to evaluate the agent's basic object manipulation abilities. It involves pick-and-place manipulation of colored cube blocks. We provide four variants of the task: cube-single, cube-double, cube-triple, and cube-quadruple, with different numbers of cubes.

cube-quadruple-play
Example dataset trajectory

cube-quadruple
Task: permute cubes

cube-quadruple
Task: stack cubes

The goal of Cube is to control the robot arm to arrange cubes into designated configurations. Each environment provides five evaluation goals that require moving, stacking, swapping, or permuting cubes. The datasets are collected by a scripted policy that repeatedly picks a random block and places it in other random locations or on another block. The agent must learn pick-and-place skills from unstructured data (which doesn't necessarily contain the evaluation tasks) and combine them sequentially.

Scene

scene-play
Example dataset trajectory

scene
Task: unlock drawer and put cube in it

scene
Task: unlock, rearrange, and then lock all

The Scene task is designed to challenge the sequential, long-horizon reasoning capabilities of the agent. It involves manipulating diverse types objects (cube, window, drawer, and button). If the agent presses a button, it toggles the lock status of the corresponding object (drawer or window). The dataset is collected by a scripted policy that randomly interacts with these objects. We provide five evaluation tasks that require the agent to arrange the objects into desired configurations. They require a significant degree of sequential reasoning: for example, the longest task involves 8 atomic behaviors (see the video on the right)! Hence, the agent must be able to plan and sequentially combine learned manipulation skills.

Puzzle

puzzle-3x3

puzzle-4x4

puzzle-4x5

puzzle-4x6

The Puzzle task is designed to test the combinatorial generalization abilities of the agent. It requires solving the "Lights Out" puzzle with a robot arm (try it out!). The rule is simple: when a button is pressed, it toggles the color of the button and its neighbors. We provide four levels of difficulty, puzzle-3x3, puzzle-4x4, puzzle-4x5, and puzzle-4x6, with different grid sizes.

puzzle-4x6-play
Example dataset trajectory

puzzle-3x3
Task: turn all buttons blue

puzzle-4x5
Task: turn all buttons blue

The goal of Puzzle is to achieve a desired configuration of colors (e.g., turning all buttons blue). Each environment provides five evaluation tasks of varying difficulty. The datasets are collected by a scripted policy that randomly presses buttons in arbitrary sequences. Given the enormous state space (with up to \(2^{24} = 16,777,216\) distinct button states), the agent must achieve combinatorial generalization while mastering low-level continuous control.

Dataset types

cube-quadruple-play
Example dataset trajectory

cube-quadruple-noisy
Example dataset trajectory

For each manipulation task, we provide two types of datasets: play and noisy. The play datasets are the standard datasets, collected by non-Markovian expert policies with temporally correlated noise, which feature natural, realistic trajectories. To support more diverse types of research (e.g., dataset ablation studies), we additionally provide the noisy datasets collected by Markovian expert policies with larger, uncorrelated Gaussian noise, which feature high state coverage.

Visual manipulation

visual-cube-quadruple-play
Example dataset trajectory

visual-scene-play
Example dataset trajectory

visual-puzzle-4x4-play
Example dataset trajectory

Every manipulation task supports both state-based and pixel-based observations with \(64 \times 64 \times 3\) RGB camera images. For pixel observations, we adjust colors and make the arm transparent to ensure full observability.

Drawing Tasks

OGBench also provides a drawing task, Powderworld, beyond robotic locomotion and manipulation.

Powderworld

powderworld-hard-play
Example dataset trajectories

powderworld-hard
Task: bubbles

powderworld-hard
Task: firework

powderworld-hard
Task: three-rooms

powderworld-hard
Task: four-squares

powderworld-hard
Task: ice-plant

Powderworld (try it out!) is a drawing task that presents unique challenges with high intrinsic dimensionality. The goal of Powderworld is to draw a target picture on a \(32 \times 32\) grid using different types of "powder" brushes, where each powder brush has a distinct physical property corresponding to a unique element. For example, the "sand" brush falls down and piles up, and the "fire" brush burns combustible elements like "plant". We provide three versions of tasks, powderworld-easy, powderworld-medium, and powderworld-hard, with different numbers of available elements. The datasets are collected by a random policy that keeps drawing arbitrary shapes with random brushes. To solve the tasks at evaluation time, the agent must achieve a high degree of generalization with a deep understanding of the physics, while handling high intrinsic dimensionality and stochasticity.

Results & Discussion

Benchmarking results

Show the full benchmark table

The above plots summarize the performances of six offline GCRL methods on OGBench, aggregated by different categories of datasets. The results are averaged over 8 seeds (4 seeds for pixel-based tasks) and error bars denote 95% bootstrap confidence intervals. In general, HIQL (a method that involves hierarchical policy extraction) tends to achieve strong performance across the board. Among the non-hierarchical methods, CRL tends to work best in locomotion tasks and GCIVL and GCIQL tend to work best in the other environments.

Which datasets should I use for my research?

Here is our guideline for choosing datasets. For general algorithms research in offline goal-conditioned RL, we recommend starting with more "regular" datasets, such as:

antmaze-{large, giant}-navigate
humanoidmaze-medium-navigate
cube-{single, double}-play
scene-play
puzzle-3x3-play

From there, depending on the performance on these tasks, try harder versions of them or more challenging tasks, including:

humanoidmaze-{large, giant}
antsoccer
puzzle-{4x4, 4x5, 4x6}
powderworld

We also provide more specialized datasets that pose specific challenges in offline GCRL.

For stitching, try the stitch datasets in the locomotion suite as well as complex manipulation tasks that require stitching (e.g., puzzle).
For long-horizon reasoning, consider humanoidmaze-giant, which has the longest episode length, and puzzle-4x6, which has the most semantic steps.
For stochastic control, try antmaze-teleport, which is specifically designed to challenge optimistically biased methods, and powderworld, which has unpredictable, stochastic dynamics.
For learning from highly suboptimal data, consider antmaze-explore as well as the noisy datasets in the manipulation suite, which feature high suboptimality and high coverage.

Research opportunities

Here, we list potential research ideas and open questions in offline goal-conditioned RL that we believe are interesting and important.

Be the first to solve unsolved tasks! While all environments in OGBench have at least one variant that current methods can solve to some degree, there are still a number of challenging tasks on which no existing method achieves non-trivial performance, such as humanoidmaze-giant, cube-triple, puzzle-4x5, powderworld-hard, and more. We ensure that sufficient data is available for those tasks (which is estimated from the amount needed to solve their easier versions). We invite researchers to take on these challenges and push the limits of offline goal-conditioned RL with better algorithms!

How can we develop a policy that generalizes well at test time? In our experiments, we found hierarchical RL methods (e.g., HIQL) to work especially well in several tasks. We hypothesize that this is mainly because hierarchical RL reduces learning complexity by having two policies specialized in different things, which makes both policies generalize better at evaluation time. After all, test-time generalization is known to be one of the major bottlenecks in offline RL. But, are hierarchies really necessary to achieve good test-time generalization? Can we develop a non-hierarchical method that enjoys the same benefit by exploiting the subgoal structure of offline goal-conditioned RL? This would be especially beneficial, not just because it is simpler, but also because it can potentially yield better, unified representations that can potentially serve as a "foundation model" for fine-tuning.

Can we develop a method that works well across all categories? Our results show that no method consistently performs best across the board. HIQL tends to achieve strong performance but struggles in pixel-based locomotion and state-based manipulation. GCIQL shows strong performance in state-based manipulation, but struggles in locomotion. CRL exhibits the opposite trend: it excels in locomotion but underperforms in manipulation. Is there a way to combine only the strengths of these methods to develop a single approach that achieves the best performance across all types of tasks?

Citation

@inproceedings{ogbench_park2025,
  title={OGBench: Benchmarking Offline Goal-Conditioned RL},
  author={Park, Seohong and Frans, Kevin and Eysenbach, Benjamin and Levine, Sergey},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025},
}

The website template was partly borrowed from Michaël Gharbi and Jon Barron.

OGBench: Benchmarking Offline Goal-Conditioned RL

Paper

Code

Overview

Why offline goal-conditioned RL?

Introducing OGBench

Features

Try OGBench!

Locomotion Tasks

PointMaze and AntMaze

HumanoidMaze

Maze types

Dataset types

AntSoccer

Visual maze navigation

Manipulation Tasks

Cube

Scene

Puzzle

Dataset types

Visual manipulation

Drawing Tasks

Powderworld

Results & Discussion

Benchmarking results

Show the full benchmark table

Which datasets should I use for my research?

Research opportunities

More concrete research questions

Citation