Sergey Levine @svlevine, Twitter Profile

Sergey Levine @svlevine

3 years ago

Implicit Q-learning, or "I can't believe it's not SARSA": state-of-the-art offline RL results, fast and easy to implement; almost SARSA, but with a different loss to provide "implicit policy improvement": arxiv.org/abs/2110.06169 w/ @ikostrikov, @ashvinair 🧵->

2 23 122 0 23

Sergey Levine @svlevine

3 years ago

Here is the idea: if we want to prevent *all* OOD action issues in offline RL, we could use *only* actions in the dataset. That leads to a SARSA update, which is very stable. But it learns the *behavior policy* value function, not the optimal value function:

1 0 3 0 0

Download Image