@nlpmattg @johnschulman2 6.3) in contrast, in RLFT, the training process does not provide a new answer that the model didn't first produce by itself. it only re-weights model's own predictions. so it teaches the model how to more effectively use its internal rep, OR to say it doesn't know.
@nlpmattg @johnschulman2 that was the argument, i hope things are clearer. its not a strict binary argument, RLFT may also have some failure cases for this. but they are not baked into the process like in SFT.
@yoavgo @johnschulman2 Thanks, that's helpful. I think I can rephrase the argument by saying that, if it is your goal to teach the model not to lie, SFT will not do that, while RL can. I buy this argument. I still don't buy the argument about what SFT does.
@yoavgo @johnschulman2 Consider the case where you literally take your SFT data and append it to your PT data. It's the same objective and same optimization parameters. Why is this different? Maybe there are some formatting differences, but there is certainly data like this in pretraining data.
@yoavgo @johnschulman2 I believe this invalidates points 3 and 6.2 in what you laid out (at least the implicit "_only_ in pretraining" in point 3). I also am not sure I buy 5, at least as it applies to the difference between SFT and PT - for any single fact, how much frequency difference is there?
@yoavgo @johnschulman2 And I think 6.3 is wrong, as I said earlier, particularly in the knowledge-seeking case that you are focused on. At least in cases where the model has non-zero probability on the right answer, but the right answer is not the mode.
@nlpmattg @johnschulman2 re SFT, it *can* teach the model new facts, but in this case I argue that it is not doing its job correctly, because you want it to teach behavior, not facts. maybe your argument is that it can teach both facts AND behavior at the same time? i can be convinced of that.
@yoavgo @johnschulman2 Yes, I agree with the point that if you want to teach a behavior with a much cheaper process than PT, you are likely better off with RL than with SFT. My issue is with the characterization of SFT as "teaching the model to lie". I think if that's true, it must also apply to PT.
@nlpmattg @johnschulman2 My argument is: 1) PT adds knowledge. 2) if SFT also adds knowledge, than nothing bad happened, but also nothing good. 3) if SFT teaches behavior, and it is likely that some of the behavior it teaches would be to lie. 4) this likelihood of teaching to lie is lower in RL
@yoavgo @johnschulman2 I still don't see why 3 is true while not also applying to 1, unless you are fine-tuning much more per example on the FT data. Which you might be - I said earlier that the optimization parameters are the same, but they might not be, and that could be a key part of the argument.
@nlpmattg @johnschulman2 this is also a good point! but for me everything in my argument boiled down to intents. our intent for PT is to make the model learn a lot of stuff. our intent for the instruct-tuning (either via RL or SFT) is to teach the model to follow instructions, not to add new knowledge.
@yoavgo @johnschulman2 Yep, I think we're agreeing at this point. Thanks for the clarification, this was helpful.