𝐃𝐢𝐫𝐞𝐜𝐭 𝐏𝐫𝐞𝐟𝐞𝐫𝐞𝐧𝐜𝐞 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧 (𝐃𝐏𝐎): 𝐨𝐧𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐦𝐨𝐬𝐭 𝐞𝐱𝐜𝐢𝐭𝐢𝐧𝐠 𝐫𝐞𝐜𝐞𝐧𝐭 𝐚𝐝𝐯𝐚𝐧𝐜𝐞𝐦𝐞𝐧𝐭𝐬 𝐢𝐧 𝐋𝐋𝐌𝐬 DPO was introduced as an alternative to Reinforcement Learning from Human Feedback (RLHF). What RLHF does is basically train a reward model in order to sync the model's output with human preference. Then, it fine-tunes the LLM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. On the other hand, DPO treats the problem as a classification problem without the need for applying reinforcement learning. The authors show that we can directly optimize an LLM on preference data. They're doing so by mapping the reward functions to optimal policies and thus transforming the loss function over rewards to a loss function over policies. As a result, we can train a policy network (instead of a reward network) that captures both the LLM and the rewards. From a mathematical perspective, the loss function of DPO is basically a binary cross-entropy loss(maximum likelihood objective) between the desired policies (that adhere to human preference) and the actual policies. During training, the gradient of the loss increases the likelihood of the preferred outputs and decreases the likelihood of the dispreferred outputs. This results in a training algorithm that is more stable and less computationally demanding than RLHF. And most importantly, it seems to be on par with the performance of RLHF. Original paper: arxiv.org/pdf/2305.18290…