Implicit Q-learning, or "I can't believe it's not SARSA": state-of-the-art offline RL results, fast and easy to implement; almost SARSA, but with a different loss to provide "implicit policy improvement": arxiv.org/abs/2110.06169 w/ @ikostrikov, @ashvinair 🧵->
2
23
122
0
23
Here is the idea: if we want to prevent *all* OOD action issues in offline RL, we could use *only* actions in the dataset. That leads to a SARSA update, which is very stable. But it learns the *behavior policy* value function, not the optimal value function: