Llama 3 was trained using intra-document causal masking, as suggested by @yuzhaouoe's paper "Analysing The Impact of Sequence Composition on Language Model Pre-Training"! 🚀🚀🚀 arxiv.org/abs/2402.13991
7
25
171
23K
91
Download Image
@PMinervini @yuzhaouoe Segmentation masks are like basic fundamental stuff that a code base should have though...
@PMinervini @yuzhaouoe Nice observation, this is indeed helpful -- but also need to acknowledge that is this just a basic feature? thought NLP researchers were doing this all the time, maybe not as efficient as we did for llama3 in terms of implementations, e.g., do padding :)
@PMinervini @yuzhaouoe It’s wild that this isn’t standard