Ilya Kostrikov @ikostrikov, Twitter Profile

Ilya Kostrikov @ikostrikov

3 years ago

Excited to present our work with @ashvinair and @svlevine, Offline RL with Implicit Q-Learning (IQL), a simple method that achieves SOTA performance on D4RL arxiv.org/abs/2110.06169 and works 4x faster than prior SOTA github.com/ikostrikov/imp… Thread below

6 21 119 0 26

Download Image

Ilya Kostrikov @ikostrikov

3 years ago

Actor-Critic algorithms can fail for offline RL when the actor outputs out-of-dataset actions for TD backups. What if we just do TD learning with the dataset actions? That is very stable, but it learns the behavior policy value function while we want the optimal value function.

1 0 5 0 0

Ilya Kostrikov @ikostrikov

3 years ago

Instead, to approximate a maximum of a Q-function instead of training an actor, IQL performs expectile regression, which does not require sampling out-of-dataset actions:

1 0 4 0 1

Download Image

Ilya Kostrikov @ikostrikov

3 years ago

Expectile regression can be implemented with a simple asymmetric MSE. Then use these estimates of V(s) in TD-backups. Finally, we extract an optimal policy with advantage-weighted behavioral cloning.

1 0 6 0 0

Download Image

Ilya Kostrikov @ikostrikov

3 years ago

While being extremely simple and fast, IQL outperforms prior work on D4RL:

1 0 4 0 0

Download Image

Ilya Kostrikov @ikostrikov

3 years ago

IQL works especially well on Ant Maze tasks that require ‘stitching’ (dynamical programming)

2 0 9 0 0

Download Gif

Ilya Kostrikov @ikostrikov

3 years ago

Finally, IQL performs extremely well for finetuning: initializing from offline data, and continuing to train online. In challenging Ant Maze tasks, it performs best offline, then gets even better online.

0 0 7 0 0

Download Image

Misha Laskin @MishaLaskin

3 years ago

@ikostrikov cool result - ant-maze is hard!

0 0 2 0 0